Skip to content

Conversation

@abuelnasr0
Copy link
Contributor

This PR fixes a bug in wordpiece tokenizer and it addresses this comment #1395 (comment)

@abuelnasr0 abuelnasr0 changed the title Fix lowercase bug Fix lowercase bug in wordpiece tokenizer Apr 2, 2024
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me. Let's just add a comment to explain.

@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label Apr 3, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Apr 3, 2024
@abuelnasr0
Copy link
Contributor Author

@mattdangerw last two commits are not necessary. First, I changed it to two stars "**" instead of "६", because "**" is impossible to be in the string so there is no possibillity to build a wrong mask. but I forgot that even if the mask was build wrong, There will be no effect, because tf_text.case_fold_utf8(text) have no effect on "६".

anyways I have a question about this two lines: https://github.com/keras-team/keras-nlp/blob/9ac333581fa55216ecd11e0fc566e12daef1cf82/keras_nlp/tokenizers/word_piece_tokenizer.py#L152-L153

what is the purpose of them?
at the first glance they were not necessary for me. and I removed them, and all tests passed.
We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?

@mattdangerw
Copy link
Member

what is the purpose of them?
at the first glance they were not necessary for me. and I removed them, and all tests passed.
We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?

We might not! Here's context #318 and https://github.com/google-research/bert/blob/master/tokenization.py#L251-L262

I'll pull in this fix for now, as we will probably be cutting a release tomorrow, but we can follow up with any additional changes here.

@mattdangerw mattdangerw merged commit 825b192 into keras-team:master Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants