Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support OOV token in Tokenizer. #5695

Closed
wants to merge 5 commits into from
Closed

support OOV token in Tokenizer. #5695

wants to merge 5 commits into from

Conversation

gewoonrik
Copy link
Contributor

I have implemented the option to replace OOV words with a OOV token instead of removing them.

@joelthchao
Copy link
Contributor

Can you write unit test for this change? Current unit test for Tokenizer seems to be really incomplete.

@gewoonrik
Copy link
Contributor Author

sure, no problem!

@gewoonrik gewoonrik changed the base branch from keras-2 to master March 19, 2017 16:07
@gewoonrik
Copy link
Contributor Author

gewoonrik commented Mar 19, 2017

while writing unit tests, I've found a one off bug in the tokenizer:

texts = ['I am', 'I was']
tokenizer = Tokenizer(num_words=1)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

sequences[0] and sequences[1] should both be equal to [1], but both are empty.

https://github.com/gewoonrik/keras/blob/25eb7cb6871fa73df9cbb805178e188699f140b7/keras/preprocessing/text.py#L206

i = self.word_index.get(w)
                if i is not None:
                    if num_words and i >= num_words:

in this case num_words == 1 and i ==1

it should be (because 0 is reserved for padding):
if num_words and i > num_words:

do you want me to fix this here or in another PR?

@gewoonrik
Copy link
Contributor Author

fixed in 6a78fd7

assert len(sequences[1]) == 5

assert np.max(np.max(sequences)) == 4
assert np.min(np.min(sequences)) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use np.min/np.max two times:

assert np.max(sequences) == 4
assert np.min(sequences) == 1

Copy link
Contributor

@joelthchao joelthchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gewoonrik
Copy link
Contributor Author

Ping @fchollet

@uniaz
Copy link

uniaz commented Apr 6, 2017

I believe this will create bugs in the 'model' since the actual vocabulary size is now num_words+1.
E.g. for the Embedding layer the input_dim should be word_vocab_size+1 instead of word_vocab_size!

@gewoonrik
Copy link
Contributor Author

gewoonrik commented Apr 6, 2017

That's why it is not the default option.
You are setting the Embedding input_dim yourself, so you should set it to word_vocab_size+1.
I tried to communicate that clearly in the documentation.

@fchollet
Copy link
Member

fchollet commented Apr 9, 2017

Maybe we should normalize on the behavior currently used by our text datasets:

start_char=1, oov_char=2, index_from=3

Which is to say that 0 is reserved, 1 is an optional start character, 2 is the OOV character, and words start at 3. Only num_words - 3 words get indexed, so that the highest index is num_words - 1.

We can default to start_char=None, oov_char=None, index_from=1 for backwards compatibility.

What does everyone think of the current API, this PR, and my API proposal?

@gewoonrik
Copy link
Contributor Author

hmm, what would you use start_char for? To signal the start of the non-padding part?
To be concrete:
you propose adding start_char and oov_char as parameters, right?
Does that mean that when oov_char=True and start_char=None:
index_from=2 and the OOV character is 1?

Regardless of the above: I don't see benefits in letting people specify the oov_character/start_character themselves. I would just keep it at use_oov_char=False or use_oov_char=True. (or None instead of False, I don't know what the python best practices are)

@gewoonrik
Copy link
Contributor Author

@joelthchao do you have an opinion about this?

@fchollet
Copy link
Member

Regardless of the above: I don't see benefits in letting people specify the oov_character/start_character themselves. I would just keep it at use_oov_char=False or use_oov_char=True. (or None instead of False, I don't know what the python best practices are)

Maybe. In any case we would need a unified API, and we would need it to be backwards compatible. If you can make it happen, that's great 👍

@gewoonrik
Copy link
Contributor Author

gewoonrik commented Apr 25, 2017

Ah, now I understand why you are referring to start_char and oov_char. They are used in the load functions of the example datasets. I didn't know that. I'll take a look at it.

@gewoonrik
Copy link
Contributor Author

I'll create a new PR when I have found the time to fix this

@gewoonrik gewoonrik closed this Jun 15, 2017
@gewoonrik gewoonrik mentioned this pull request Jun 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants