-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for CJK Char Splitting for WordPiece Tokenizer #318
Add Support for CJK Char Splitting for WordPiece Tokenizer #318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Just a few comments
@@ -63,8 +54,24 @@ | |||
] | |||
) | |||
|
|||
# Matches CJK characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a link to the place we got this from in the BERT repo. It's a good reference to track.
|
||
def pretokenize(text, lowercase, strip_accents, split): | ||
def pretokenize( | ||
text, lowercase=True, strip_accents=True, split=True, split_on_cjk=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a trailing comma so this formats to multiple lines
@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split): | |||
# Preprocess, lowercase, strip and split input data. | |||
if text.shape.rank == 0: | |||
text = tf.expand_dims(text, 0) | |||
if split_on_cjk and split: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow interesting, it looks like we were already splitting on these. I think in that case we could actually avoid this regex_replace
call and just keep two different split regexes.
WHITESPACE_AND_PUNCTUATION_REGEX
and WHITESPACE_PUNCTUATION_AND_CJK_REGEX
if split:
if split_on_cjk:
split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX
else:
split_pattern = WHITESPACE_AND_PUNCTUATION_REGEX
text = tf_text.regex_split(...)
Would this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, yeah. But we'll have to create a third REGEX, namely, PUNCTUATION_AND_CJK_REGEX
to pass to keep_delim_regex_pattern
. Isn't it easier to just keep tf.regex_replace()
, then? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing changes for now, let me know if it's better to revert back to the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
@@ -99,10 +133,16 @@ def pretokenize(text, lowercase, strip_accents, split): | |||
# Remove the accent marks. | |||
text = tf.strings.regex_replace(text, r"\p{Mn}", "") | |||
if split: | |||
if split_on_cjk: | |||
split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delim_regex_pattern I guess? since you are agreeing with the other arg name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, I just changed it to keep_split_pattern
:P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works too!
@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split): | |||
# Preprocess, lowercase, strip and split input data. | |||
if text.shape.rank == 0: | |||
text = tf.expand_dims(text, 0) | |||
if split_on_cjk and split: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work and investigation on this!
Resolves #307
A few comments:
PUNCTUATION_REGEX
. Shifted it to a new variable, and added an arg for splitting on CJK (split_on_cjk
).