Small fixes for special_tokens arg in BPE #969

abheesht17 · 2023-04-06T18:33:43Z

Renamed arg to special_tokens_lst (since subclass WhisperTokenizer also has the same special_tokens arg). Open to suggestions.
Added doc-string for special_tokens_lst (open to a better explanation).
Added missing __init__ args to get_config().

mattdangerw · 2023-04-06T20:09:50Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

        sequence_length=None,
        add_prefix_space=False,
-        special_tokens=None,
+        special_tokens_lst=None,


I think a rename would make sense, but maybe we just name this unsplittable_tokens or something like that. That would be more literal as to what this is doing. Docstring...

unsplittable_tokens: A list of strings that will never be split during the word-level splitting applied before the byte-pair encoding. This can be used to ensure string special tokens map to a unique
index in the vocabulary, even if these special tokens contain splittable characters such as punctation.

mattdangerw · 2023-04-06T20:12:53Z

/gcbrun

mattdangerw · 2023-04-07T01:12:20Z

keras_nlp/models/bart/bart_tokenizer.py

+        pad_token = "<pad>"
+        end_token = "</s>"
+
+        if "unsplittable_tokens" in kwargs:


I agree that this is probably better UX, but just in terms of our own code maintenance, I think we can ditch this. If users pass this argument they will get an error message abut duplicate kwargs, and can check the code here. It is fairly self explanatory.

mattdangerw · 2023-04-07T17:02:41Z

/gcbrun

mattdangerw

LGTM!

chenmoneygithub

Thanks! 2 minor comments.

chenmoneygithub · 2023-04-07T18:22:48Z

keras_nlp/models/bart/bart_tokenizer.py

+        # In the constructor, we pass the list of special tokens to the
+        # `unsplittable_tokens` arg of the superclass' constructor. Hence, we
+        # delete it from the config here.
+        del config["unsplittable_tokens"]


If we don't remove it from the config, will that just get written again in the from_config time?

If we don't remove if from the config, if will get passed to the BartTokenizer constructor. I was advising the del as we don't actually want to support the user providing their own unsplittables at the Bart modeling level. BartTokenizer takes over unsplittable_tokens as an internal option.

chenmoneygithub · 2023-04-07T18:23:45Z

keras_nlp/tokenizers/byte_pair_tokenizer.py

            and will tokenize a word with a leading space differently. Adding
            a prefix space to the first word will cause it to be tokenized
            equivalently to all subsequent words in the sequence.
+        unsplittable_tokens: list, defaults to None. A list of strings that will


Maybe worth calling out these "tokens" must exist in the vocab.

sg! I can just add that as I merge

mattdangerw · 2023-04-07T18:56:45Z

GCP testing looks good! The only failure was #960

abheesht17 added 3 commits April 6, 2023 23:44

Small fixes for special_tokens arg in BPE

d4cf889

Fix GPT2Tokenizer

548e9b6

Ignore special_tokens_lst in GPT2Tokenizer

0abf607

abheesht17 requested a review from chenmoneygithub April 6, 2023 18:33

abheesht17 added 2 commits April 7, 2023 00:33

Fix saving layer

d811d8d

Add UTs

320b5c5

abheesht17 requested a review from mattdangerw April 6, 2023 19:36

mattdangerw reviewed Apr 6, 2023

View reviewed changes

abheesht17 added 3 commits April 7, 2023 05:02

Remove get_config()

c999589

Rename to unsplittable_tokens

a5b2db2

Modify comment

533b17f

mattdangerw reviewed Apr 7, 2023

View reviewed changes

abheesht17 added 2 commits April 7, 2023 13:25

Remove warning

adf1e18

Small comment edit

a8cf991

mattdangerw approved these changes Apr 7, 2023

View reviewed changes

chenmoneygithub reviewed Apr 7, 2023

View reviewed changes

Add note on special tokens

b9cffc7

mattdangerw merged commit 21f9bcc into keras-team:master Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Small fixes for special_tokens arg in BPE #969

Small fixes for special_tokens arg in BPE #969

Uh oh!

abheesht17 commented Apr 6, 2023

Uh oh!

mattdangerw Apr 6, 2023

Uh oh!

mattdangerw commented Apr 6, 2023

Uh oh!

mattdangerw Apr 7, 2023

Uh oh!

mattdangerw commented Apr 7, 2023

Uh oh!

mattdangerw left a comment

Uh oh!

chenmoneygithub left a comment

Uh oh!

chenmoneygithub Apr 7, 2023

Uh oh!

mattdangerw Apr 7, 2023

Uh oh!

chenmoneygithub Apr 7, 2023

Uh oh!

mattdangerw Apr 7, 2023

Uh oh!

mattdangerw commented Apr 7, 2023

Uh oh!

Uh oh!

Small fixes for special_tokens arg in BPE #969

Small fixes for special_tokens arg in BPE #969

Uh oh!

Conversation

abheesht17 commented Apr 6, 2023

Uh oh!

mattdangerw Apr 6, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 6, 2023

Uh oh!

mattdangerw Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 7, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 7, 2023

Uh oh!

Uh oh!