Support tokenization of special tokens for word_piece_tokenizer #1397

abuelnasr0 · 2024-01-09T21:05:26Z

this PR enables the user to tokenize special tokens from text input as was suggested in #1395.
The behaviour is the same as ByteBairTokenizer I think. also the models' tokenizers that use WordPieceTokenizer have the same behaviour as the models' tokenizers that use BytePairTokenizer.

mattdangerw · 2024-01-15T05:00:03Z

Thanks! This is a great feature.

I will think more on this next week. I like the overall approach of doing this on top of tf-text instead of inside the tf-text ops. But we may want to think about this more generally. Something like this as an end state...

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.
For pretrained tokenizers, this is setup automatically to respect the model's special tokens.
For a base tokenizer, a special token list can be provided by the user.
We do this eventually for all subword tokenizers--sentencepiece, byte pair, and word piece.

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

@abuelnasr0 let me know what you think! Once we have an overall plan, landing this incrementally (e.g. starting with word piece as you are here), sounds good!

abuelnasr0 · 2024-02-13T21:55:31Z

@mattdangerw

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Do you think special_tokens_in_strings would better be a call argument in tokenize method or initialization argument?

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

For byte pair and wordpiece (by this PR), I think that is guaranteed.
For sentencepiece, I am currently writing a Gist that will clarify how I will add that feature to sentencepiece. It guarantees that tf-text will not be touched.
I will send the link here after I finish it. may be tomorrow.

abuelnasr0 · 2024-02-20T20:03:52Z

@mattdangerw checkout this two PR #1445 (support for sentencepiece) and #1447 (that fixes a small nit in the tokenizer but we can use the same PR to make the behaviour similar for all)

For Sentencepiece, I named the argument special_tokens instead of unsplittable_tokens because there is really no splitting before tokenization in Sentencepiece. I named it also special_tokens for WordPiece (this PR) but it can be named unsplittable_tokens as there's splitting before tokenization. for the BytePair tokenizer it must be unsplittable_tokens because special_tokens will make a conflict with WhisperTokenizer as it uses special_tokens keyword.
the point is can the argument be named:

special_tokens for Sentencepiece
special_tokens or unsplittable_tokens for WordPiece
unsplittable_tokens for BytePair
Or we should just name it unsplittable_tokens for Sentencepiece even if it's not precise naming.

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Regarding special_tokens_in_strings that you suggested, I think it has no importance if it will be used as construction argument because we are passing special_tokens list. and if user passes a special_tokens list that means that there is a special tokens in the strings that need to be handled.
But if you mean a call argument then it will be useful, it will give the user the ability to tokenize inputs without caring about special tokens even if a special_tokens list was passed during construction.

mattdangerw

OK, dropping some comments here for the whole set of PRs here.

Re special_tokens or unsplittable_tokens, I think what we are missing here is configurability. The goal is to be able to turn off or on this behavior not at the base tokenizer class but at the model level implementations. Basically, we want this...

tokenizer = BertTokenizer.from_preset("bert_base_en")
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[", "SE", "#P", "]".
tokenizer = BertTokenizer.from_preset("bert_base_en", special_tokens_in_strings=True)
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[SEP]", token 102.

We want to give users a knob, that works in a cross model way, to either disallow special tokens in strings, or allow them (without even needing to know all the special tokens themselves).

If you are building a production system, you usually want this setting off. You don't want random string that happen to contain special tokens to mess with the number of segments you have.

If you are writing a guide, or building a prototype, you will often want to allow this. Gaining the ability to add special tokens in string space is very readable.

Does that make sense?

keras_nlp/models/bert/bert_tokenizer.py

abuelnasr0 · 2024-03-08T00:34:50Z

Does that make sense?

@mattdangerw Indeed! Thanks for that, it sounds rational, and it's a point of view that I was not aware of. I will address that for WordPieceTokenizer and BytePairTokenizer.

abuelnasr0 · 2024-03-13T22:44:13Z

@mattdangerw Check out last changes. I have add special_tokens_in_strings Arg for base tokenizer and model tokenizers. If all good, I will go ahead and do the same for BytePairTokenizer
The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True, it should tokenize special tokens correctly.

mattdangerw · 2024-03-13T22:59:25Z

Thanks! Will take a look shortly!

The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True

Yeah this ok I think. We would just want to call it out as a breaking change clearly in our next release. Thankfully the fix is quite clear, for anyone who wants the old behavior, pass the option.

mattdangerw

Nice! This looks good to me. Just a couple comments.

keras_nlp/tokenizers/word_piece_tokenizer.py

keras_nlp/models/distil_bert/distil_bert_tokenizer.py

keras_nlp/tokenizers/word_piece_tokenizer.py

mattdangerw

lgtm! will pull in when CI is done

…s-team#1397) * Support tokenization of special tokens for word_piece_tokenizer * Add the feature to models tokenizers * Format the code * Fix Fromat * Small fixes * Add tests for bert * Add tests for distilbert * Small fix for bert test * Add tests for electra * Fix code format * Rename unsplittable to special * Edit special_tokens Arg * Format the code * Move special tokens checking into base class * Add special_tokens_in_strings Arg * Shorten comments * Shorten comments * Shorten the logic og splitting and add comments * Code format

abuelnasr0 mentioned this pull request Feb 13, 2024

Issue with BytePairTokenizer #1435

Open

abuelnasr0 mentioned this pull request Feb 19, 2024

Support tokenization of special tokens for sentencepiece tokenizer #1445

Closed

mattdangerw self-requested a review February 23, 2024 00:46

mattdangerw mentioned this pull request Feb 23, 2024

Fix BytePair special tokens tokenization #1447

Closed

mattdangerw requested changes Mar 5, 2024

View reviewed changes

keras_nlp/models/bert/bert_tokenizer.py Outdated Show resolved Hide resolved

abuelnasr0 added 15 commits March 13, 2024 21:57

Support tokenization of special tokens for word_piece_tokenizer

c06c715

Add the feature to models tokenizers

a3fa848

Format the code

30e977c

Fix Fromat

83c559f

Small fixes

d76ebaa

Add tests for bert

d749773

Add tests for distilbert

8eec62c

Small fix for bert test

d9598e5

Add tests for electra

bd50f39

Fix code format

80dd97c

Rename unsplittable to special

66203e6

Edit special_tokens Arg

1644e49

Format the code

661afd0

Move special tokens checking into base class

0f4c93e

Add special_tokens_in_strings Arg

d927a86

abuelnasr0 force-pushed the WP_tokenizer_special_tokens branch from 0fe9419 to d927a86 Compare March 13, 2024 22:33

abuelnasr0 requested a review from mattdangerw March 13, 2024 22:34

mattdangerw reviewed Mar 18, 2024

View reviewed changes

Shorten comments

21bbc24

abuelnasr0 added 3 commits March 18, 2024 23:24

Shorten comments

fdd1932

Shorten the logic og splitting and add comments

e934765

Code format

662063f

abuelnasr0 requested a review from mattdangerw March 18, 2024 23:52

mattdangerw added the kokoro:force-run Runs Tests on GPU label Mar 20, 2024

mattdangerw approved these changes Mar 20, 2024

View reviewed changes

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 20, 2024

mattdangerw merged commit 45d8bd3 into keras-team:master Mar 20, 2024
10 checks passed

mattdangerw mentioned this pull request Mar 26, 2024

No tokenizer option to add special tokens ([MASK], [PAD]) inside string input #1395

Open

abuelnasr0 deleted the WP_tokenizer_special_tokens branch April 2, 2024 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tokenization of special tokens for word_piece_tokenizer #1397

Support tokenization of special tokens for word_piece_tokenizer #1397

abuelnasr0 commented Jan 9, 2024

mattdangerw commented Jan 15, 2024

abuelnasr0 commented Feb 13, 2024

abuelnasr0 commented Feb 20, 2024 •

edited

Loading

mattdangerw left a comment

abuelnasr0 commented Mar 8, 2024 •

edited

Loading

abuelnasr0 commented Mar 13, 2024

mattdangerw commented Mar 13, 2024

mattdangerw left a comment

mattdangerw left a comment

Support tokenization of special tokens for word_piece_tokenizer #1397

Support tokenization of special tokens for word_piece_tokenizer #1397

Conversation

abuelnasr0 commented Jan 9, 2024

mattdangerw commented Jan 15, 2024

abuelnasr0 commented Feb 13, 2024

abuelnasr0 commented Feb 20, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

abuelnasr0 commented Mar 8, 2024 • edited Loading

abuelnasr0 commented Mar 13, 2024

mattdangerw commented Mar 13, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment

abuelnasr0 commented Feb 20, 2024 •

edited

Loading

abuelnasr0 commented Mar 8, 2024 •

edited

Loading