support OOV token in Tokenizer. #5695

gewoonrik · 2017-03-10T15:34:38Z

I have implemented the option to replace OOV words with a OOV token instead of removing them.

joelthchao · 2017-03-19T15:54:15Z

Can you write unit test for this change? Current unit test for Tokenizer seems to be really incomplete.

gewoonrik · 2017-03-19T16:04:29Z

sure, no problem!

gewoonrik · 2017-03-19T16:23:36Z

while writing unit tests, I've found a one off bug in the tokenizer:

texts = ['I am', 'I was']
tokenizer = Tokenizer(num_words=1)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

sequences[0] and sequences[1] should both be equal to [1], but both are empty.

https://github.com/gewoonrik/keras/blob/25eb7cb6871fa73df9cbb805178e188699f140b7/keras/preprocessing/text.py#L206

i = self.word_index.get(w)
                if i is not None:
                    if num_words and i >= num_words:

in this case num_words == 1 and i ==1

it should be (because 0 is reserved for padding):
if num_words and i > num_words:

do you want me to fix this here or in another PR?

gewoonrik · 2017-03-19T16:35:08Z

fixed in 6a78fd7

joelthchao · 2017-03-20T05:24:45Z

tests/keras/preprocessing/text_test.py

+    assert len(sequences[1]) == 5
+
+    assert np.max(np.max(sequences)) == 4
+    assert np.min(np.min(sequences)) == 1


no need to use np.min/np.max two times:

assert np.max(sequences) == 4 assert np.min(sequences) == 1

joelthchao

LGTM

gewoonrik · 2017-03-27T07:42:29Z

Ping @fchollet

uniaz · 2017-04-06T15:34:30Z

I believe this will create bugs in the 'model' since the actual vocabulary size is now num_words+1.
E.g. for the Embedding layer the input_dim should be word_vocab_size+1 instead of word_vocab_size!

gewoonrik · 2017-04-06T19:22:32Z

That's why it is not the default option.
You are setting the Embedding input_dim yourself, so you should set it to word_vocab_size+1.
I tried to communicate that clearly in the documentation.

fchollet · 2017-04-09T21:56:58Z

Maybe we should normalize on the behavior currently used by our text datasets:

start_char=1, oov_char=2, index_from=3

Which is to say that 0 is reserved, 1 is an optional start character, 2 is the OOV character, and words start at 3. Only num_words - 3 words get indexed, so that the highest index is num_words - 1.

We can default to start_char=None, oov_char=None, index_from=1 for backwards compatibility.

What does everyone think of the current API, this PR, and my API proposal?

gewoonrik · 2017-04-10T08:51:53Z

hmm, what would you use start_char for? To signal the start of the non-padding part?
To be concrete:
you propose adding start_char and oov_char as parameters, right?
Does that mean that when oov_char=True and start_char=None:
index_from=2 and the OOV character is 1?

Regardless of the above: I don't see benefits in letting people specify the oov_character/start_character themselves. I would just keep it at use_oov_char=False or use_oov_char=True. (or None instead of False, I don't know what the python best practices are)

gewoonrik · 2017-04-25T09:47:32Z

@joelthchao do you have an opinion about this?

fchollet · 2017-04-25T18:38:11Z

Regardless of the above: I don't see benefits in letting people specify the oov_character/start_character themselves. I would just keep it at use_oov_char=False or use_oov_char=True. (or None instead of False, I don't know what the python best practices are)

Maybe. In any case we would need a unified API, and we would need it to be backwards compatible. If you can make it happen, that's great 👍

gewoonrik · 2017-04-25T22:17:07Z

Ah, now I understand why you are referring to start_char and oov_char. They are used in the load functions of the example datasets. I didn't know that. I'll take a look at it.

gewoonrik · 2017-06-15T12:53:41Z

I'll create a new PR when I have found the time to fix this

support OOV token in Tokenizer.

25eb7cb

gewoonrik changed the base branch from keras-2 to master March 19, 2017 16:07

gewoonrik added 2 commits March 19, 2017 17:32

fix one off bug in tokenizer

6a78fd7

add unit tests for OOV support

068bf3f

fixed PEP8

0f56ce2

joelthchao reviewed Mar 20, 2017

View reviewed changes

review fixes.

8149cc4

joelthchao approved these changes Mar 23, 2017

View reviewed changes

gewoonrik closed this Jun 15, 2017

gewoonrik mentioned this pull request Jun 20, 2017

Tokenizer mod #7053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support OOV token in Tokenizer. #5695

support OOV token in Tokenizer. #5695

gewoonrik commented Mar 10, 2017

joelthchao commented Mar 19, 2017

gewoonrik commented Mar 19, 2017

gewoonrik commented Mar 19, 2017 •

edited

gewoonrik commented Mar 19, 2017

joelthchao Mar 20, 2017

joelthchao left a comment

gewoonrik commented Mar 27, 2017

uniaz commented Apr 6, 2017

gewoonrik commented Apr 6, 2017 •

edited

fchollet commented Apr 9, 2017 •

edited

gewoonrik commented Apr 10, 2017

gewoonrik commented Apr 25, 2017

fchollet commented Apr 25, 2017

gewoonrik commented Apr 25, 2017 •

edited

gewoonrik commented Jun 15, 2017

support OOV token in Tokenizer. #5695

support OOV token in Tokenizer. #5695

Conversation

gewoonrik commented Mar 10, 2017

joelthchao commented Mar 19, 2017

gewoonrik commented Mar 19, 2017

gewoonrik commented Mar 19, 2017 • edited

gewoonrik commented Mar 19, 2017

joelthchao Mar 20, 2017

Choose a reason for hiding this comment

joelthchao left a comment

Choose a reason for hiding this comment

gewoonrik commented Mar 27, 2017

uniaz commented Apr 6, 2017

gewoonrik commented Apr 6, 2017 • edited

fchollet commented Apr 9, 2017 • edited

gewoonrik commented Apr 10, 2017

gewoonrik commented Apr 25, 2017

fchollet commented Apr 25, 2017

gewoonrik commented Apr 25, 2017 • edited

gewoonrik commented Jun 15, 2017

gewoonrik commented Mar 19, 2017 •

edited

gewoonrik commented Apr 6, 2017 •

edited

fchollet commented Apr 9, 2017 •

edited

gewoonrik commented Apr 25, 2017 •

edited