Add Support for CJK Char Splitting for WordPiece Tokenizer #318

abheesht17 · 2022-08-27T06:43:35Z

Resolves #307

A few comments:

CJK regex was already present in PUNCTUATION_REGEX. Shifted it to a new variable, and added an arg for splitting on CJK (split_on_cjk).
Made similar changes in WordPiece trainer.

mattdangerw

lgtm! Just a few comments

mattdangerw · 2022-08-29T17:31:37Z

keras_nlp/tokenizers/word_piece_tokenizer.py

@@ -63,8 +54,24 @@
    ]
 )

+# Matches CJK characters.


Let's add a link to the place we got this from in the BERT repo. It's a good reference to track.

mattdangerw · 2022-08-29T17:32:13Z

keras_nlp/tokenizers/word_piece_tokenizer.py


-def pretokenize(text, lowercase, strip_accents, split):
+def pretokenize(
+    text, lowercase=True, strip_accents=True, split=True, split_on_cjk=True


nit: add a trailing comma so this formats to multiple lines

mattdangerw · 2022-08-29T17:44:57Z

keras_nlp/tokenizers/word_piece_tokenizer.py

@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split):
    # Preprocess, lowercase, strip and split input data.
    if text.shape.rank == 0:
        text = tf.expand_dims(text, 0)
+    if split_on_cjk and split:


Wow interesting, it looks like we were already splitting on these. I think in that case we could actually avoid this regex_replace call and just keep two different split regexes.

WHITESPACE_AND_PUNCTUATION_REGEX and WHITESPACE_PUNCTUATION_AND_CJK_REGEX

if split: if split_on_cjk: split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX else: split_pattern = WHITESPACE_AND_PUNCTUATION_REGEX text = tf_text.regex_split(...)

Would this work?

Hmmm, yeah. But we'll have to create a third REGEX, namely, PUNCTUATION_AND_CJK_REGEX to pass to keep_delim_regex_pattern. Isn't it easier to just keep tf.regex_replace(), then? What do you think?

Pushing changes for now, let me know if it's better to revert back to the original.

This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).

mattdangerw · 2022-08-29T18:25:22Z

keras_nlp/tokenizers/word_piece_tokenizer.py

@@ -99,10 +133,16 @@ def pretokenize(text, lowercase, strip_accents, split):
        # Remove the accent marks.
        text = tf.strings.regex_replace(text, r"\p{Mn}", "")
    if split:
+        if split_on_cjk:
+            split_pattern = WHITESPACE_PUNCTUATION_AND_CJK_REGEX


delim_regex_pattern I guess? since you are agreeing with the other arg name

Hehe, I just changed it to keep_split_pattern :P

that works too!

mattdangerw · 2022-08-29T18:26:23Z

keras_nlp/tokenizers/word_piece_tokenizer.py

@@ -91,6 +102,8 @@ def pretokenize(text, lowercase, strip_accents, split):
    # Preprocess, lowercase, strip and split input data.
    if text.shape.rank == 0:
        text = tf.expand_dims(text, 0)
+    if split_on_cjk and split:


This looks fine to me. Original is simpler, but we probably should care about efficiency here (and cutting an op should help that).

mattdangerw

Thanks for the work and investigation on this!

abheesht17 added 7 commits August 27, 2022 12:12

Add split_on_cjk option

4a83441

Remove CJK unicode ranges from punctuation regex

2fc0786

Format

ead2d7a

Remove tf.print statement

912f11a

Add and in if condition

c9eaa7e

Add default vals in pretokenize()

ca0816c

Add split_on_cjk to WP trainer

04918c8

This was referenced Aug 27, 2022

bert_base_zh, bert_base_multi_cased: Add BERT Base Variants #319

Merged

Add Checkpoint Conversion Notebook for bert_base_multi_cased #320

Closed

mattdangerw approved these changes Aug 29, 2022

View reviewed changes

abheesht17 added 3 commits August 29, 2022 23:54

Address comments - I

76e3a72

Minor edits

006c069

Minor edit

d3edab3

mattdangerw approved these changes Aug 29, 2022

View reviewed changes

mattdangerw merged commit 082ccaa into keras-team:master Aug 29, 2022

mattdangerw mentioned this pull request Apr 4, 2024

Fix lowercase bug in wordpiece tokenizer #1543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for CJK Char Splitting for WordPiece Tokenizer #318

Add Support for CJK Char Splitting for WordPiece Tokenizer #318

abheesht17 commented Aug 27, 2022 •

edited

Loading

mattdangerw left a comment

mattdangerw Aug 29, 2022

mattdangerw Aug 29, 2022

mattdangerw Aug 29, 2022

abheesht17 Aug 29, 2022

abheesht17 Aug 29, 2022

mattdangerw Aug 29, 2022

abheesht17 Aug 29, 2022

mattdangerw Aug 29, 2022

abheesht17 Aug 29, 2022

mattdangerw Aug 29, 2022

mattdangerw Aug 29, 2022

mattdangerw left a comment

Add Support for CJK Char Splitting for WordPiece Tokenizer #318

Add Support for CJK Char Splitting for WordPiece Tokenizer #318

Conversation

abheesht17 commented Aug 27, 2022 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

abheesht17 commented Aug 27, 2022 •

edited

Loading