Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for CJK Char Splitting for WordPiece Tokenizer #318

Merged
merged 10 commits into from
Aug 29, 2022

Conversation

abheesht17
Copy link
Collaborator

@abheesht17 abheesht17 commented Aug 27, 2022

Resolves #307

A few comments:

  • CJK regex was already present in PUNCTUATION_REGEX. Shifted it to a new variable, and added an arg for splitting on CJK (split_on_cjk).
  • Made similar changes in WordPiece trainer.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Just a few comments

@@ -63,8 +54,24 @@
]
)

# Matches CJK characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a link to the place we got this from in the BERT repo. It's a good reference to track.


def pretokenize(text, lowercase, strip_accents, split):
def pretokenize(
text, lowercase=True, strip_accents=True, split=True, split_on_cjk=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a trailing comma so this formats to multiple lines

@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split):
# Preprocess, lowercase, strip and split input data.
if text.shape.rank == 0:
text = tf.expand_dims(text, 0)
if split_on_cjk and split:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow interesting, it looks like we were already splitting on these. I think in that case we could actually avoid this regex_replace call and just keep two different split regexes.

WHITESPACE_AND_PUNCTUATION_REGEX and WHITESPACE_PUNCTUATION_AND_CJK_REGEX

if split:
    if split_on_cjk:
         split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX
    else:
         split_pattern = WHITESPACE_AND_PUNCTUATION_REGEX
    text = tf_text.regex_split(...)

Would this work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yeah. But we'll have to create a third REGEX, namely, PUNCTUATION_AND_CJK_REGEX to pass to keep_delim_regex_pattern. Isn't it easier to just keep tf.regex_replace(), then? What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing changes for now, let me know if it's better to revert back to the original.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@@ -99,10 +133,16 @@ def pretokenize(text, lowercase, strip_accents, split):
# Remove the accent marks.
text = tf.strings.regex_replace(text, r"\p{Mn}", "")
if split:
if split_on_cjk:
split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delim_regex_pattern I guess? since you are agreeing with the other arg name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I just changed it to keep_split_pattern :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works too!

@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split):
# Preprocess, lowercase, strip and split input data.
if text.shape.rank == 0:
text = tf.expand_dims(text, 0)
if split_on_cjk and split:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work and investigation on this!

@mattdangerw mattdangerw merged commit 082ccaa into keras-team:master Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support CJK character splitting for WorkPiece tokenizer
2 participants