Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tokenization of special tokens for word_piece_tokenizer #1397

Merged
merged 19 commits into from
Mar 20, 2024

Conversation

abuelnasr0
Copy link
Contributor

this PR enables the user to tokenize special tokens from text input as was suggested in #1395.
The behaviour is the same as ByteBairTokenizer I think. also the models' tokenizers that use WordPieceTokenizer have the same behaviour as the models' tokenizers that use BytePairTokenizer.

@mattdangerw
Copy link
Member

Thanks! This is a great feature.

I will think more on this next week. I like the overall approach of doing this on top of tf-text instead of inside the tf-text ops. But we may want to think about this more generally. Something like this as an end state...

  • All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.
  • For pretrained tokenizers, this is setup automatically to respect the model's special tokens.
  • For a base tokenizer, a special token list can be provided by the user.
  • We do this eventually for all subword tokenizers--sentencepiece, byte pair, and word piece.

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

@abuelnasr0 let me know what you think! Once we have an overall plan, landing this incrementally (e.g. starting with word piece as you are here), sounds good!

@abuelnasr0
Copy link
Contributor Author

@mattdangerw

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Do you think special_tokens_in_strings would better be a call argument in tokenize method or initialization argument?

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

For byte pair and wordpiece (by this PR), I think that is guaranteed.
For sentencepiece, I am currently writing a Gist that will clarify how I will add that feature to sentencepiece. It guarantees that tf-text will not be touched.
I will send the link here after I finish it. may be tomorrow.

@abuelnasr0
Copy link
Contributor Author

abuelnasr0 commented Feb 20, 2024

@mattdangerw checkout this two PR #1445 (support for sentencepiece) and #1447 (that fixes a small nit in the tokenizer but we can use the same PR to make the behaviour similar for all)

For Sentencepiece, I named the argument special_tokens instead of unsplittable_tokens because there is really no splitting before tokenization in Sentencepiece. I named it also special_tokens for WordPiece (this PR) but it can be named unsplittable_tokens as there's splitting before tokenization. for the BytePair tokenizer it must be unsplittable_tokens because special_tokens will make a conflict with WhisperTokenizer as it uses special_tokens keyword.
the point is can the argument be named:

  • special_tokens for Sentencepiece
  • special_tokens or unsplittable_tokens for WordPiece
  • unsplittable_tokens for BytePair
    Or we should just name it unsplittable_tokens for Sentencepiece even if it's not precise naming.

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Regarding special_tokens_in_strings that you suggested, I think it has no importance if it will be used as construction argument because we are passing special_tokens list. and if user passes a special_tokens list that means that there is a special tokens in the strings that need to be handled.
But if you mean a call argument then it will be useful, it will give the user the ability to tokenize inputs without caring about special tokens even if a special_tokens list was passed during construction.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, dropping some comments here for the whole set of PRs here.

Re special_tokens or unsplittable_tokens, I think what we are missing here is configurability. The goal is to be able to turn off or on this behavior not at the base tokenizer class but at the model level implementations. Basically, we want this...

tokenizer = BertTokenizer.from_preset("bert_base_en")
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[", "SE", "#P", "]".
tokenizer = BertTokenizer.from_preset("bert_base_en", special_tokens_in_strings=True)
tokenizer(["Bert uses [SEP] to mark sequence boundaries"])
# Tokenize as the tokens "[SEP]", token 102.

We want to give users a knob, that works in a cross model way, to either disallow special tokens in strings, or allow them (without even needing to know all the special tokens themselves).

If you are building a production system, you usually want this setting off. You don't want random string that happen to contain special tokens to mess with the number of segments you have.

If you are writing a guide, or building a prototype, you will often want to allow this. Gaining the ability to add special tokens in string space is very readable.

Does that make sense?

keras_nlp/models/bert/bert_tokenizer.py Outdated Show resolved Hide resolved
@abuelnasr0
Copy link
Contributor Author

abuelnasr0 commented Mar 8, 2024

Does that make sense?

@mattdangerw Indeed! Thanks for that, it sounds rational, and it's a point of view that I was not aware of. I will address that for WordPieceTokenizer and BytePairTokenizer.

@abuelnasr0
Copy link
Contributor Author

@mattdangerw Check out last changes. I have add special_tokens_in_strings Arg for base tokenizer and model tokenizers. If all good, I will go ahead and do the same for BytePairTokenizer
The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True, it should tokenize special tokens correctly.

@mattdangerw
Copy link
Member

Thanks! Will take a look shortly!

The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True

Yeah this ok I think. We would just want to call it out as a breaking change clearly in our next release. Thankfully the fix is quite clear, for anyone who wants the old behavior, pass the option.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks good to me. Just a couple comments.

keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/models/distil_bert/distil_bert_tokenizer.py Outdated Show resolved Hide resolved
keras_nlp/tokenizers/word_piece_tokenizer.py Outdated Show resolved Hide resolved
@mattdangerw mattdangerw added the kokoro:force-run Runs Tests on GPU label Mar 20, 2024
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! will pull in when CI is done

@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 20, 2024
@mattdangerw mattdangerw merged commit 45d8bd3 into keras-team:master Mar 20, 2024
10 checks passed
abuelnasr0 added a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024
…s-team#1397)

* Support tokenization of special tokens for word_piece_tokenizer

* Add the feature to models tokenizers

* Format the code

* Fix Fromat

* Small fixes

* Add tests for bert

* Add tests for distilbert

* Small fix for bert test

* Add tests for electra

* Fix code format

* Rename unsplittable to special

* Edit special_tokens Arg

* Format the code

* Move special tokens checking into base class

* Add special_tokens_in_strings Arg

* Shorten comments

* Shorten comments

* Shorten the logic og splitting and add comments

* Code format
@abuelnasr0 abuelnasr0 deleted the WP_tokenizer_special_tokens branch April 2, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants