New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support OOV token in Tokenizer. #5695
Conversation
Can you write unit test for this change? Current unit test for |
sure, no problem! |
while writing unit tests, I've found a one off bug in the tokenizer:
sequences[0] and sequences[1] should both be equal to [1], but both are empty.
in this case it should be (because 0 is reserved for padding): do you want me to fix this here or in another PR? |
fixed in 6a78fd7 |
assert len(sequences[1]) == 5 | ||
|
||
assert np.max(np.max(sequences)) == 4 | ||
assert np.min(np.min(sequences)) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use np.min
/np.max
two times:
assert np.max(sequences) == 4
assert np.min(sequences) == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ping @fchollet |
I believe this will create bugs in the 'model' since the actual vocabulary size is now |
That's why it is not the default option. |
Maybe we should normalize on the behavior currently used by our text datasets:
Which is to say that 0 is reserved, 1 is an optional start character, 2 is the OOV character, and words start at 3. Only We can default to What does everyone think of the current API, this PR, and my API proposal? |
hmm, what would you use Regardless of the above: I don't see benefits in letting people specify the oov_character/start_character themselves. I would just keep it at |
@joelthchao do you have an opinion about this? |
Maybe. In any case we would need a unified API, and we would need it to be backwards compatible. If you can make it happen, that's great 👍 |
Ah, now I understand why you are referring to |
I'll create a new PR when I have found the time to fix this |
I have implemented the option to replace OOV words with a OOV token instead of removing them.