adding user defined tokens #30824 #30929

itazap · 2024-05-21T08:10:01Z

Tasks

fix converter to handle user_defined_symbols
create necessary flags for user_defined_symbols
update docs
test slow == fast
add test without sentencepiece

Who can review?

@ArthurZucker

HuggingFaceDocBuilderDev · 2024-05-21T08:29:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/llama/tokenization_llama_fast.py

ArthurZucker · 2024-06-04T15:29:05Z

src/transformers/convert_slow_tokenizer.py

+        control_symbols = [
+            AddedToken(token, normalized=True, special=False) for token in self.proto.trainer_spec.control_symbols
+        ]
+        tokenizer.add_tokens(user_defined_symbols + control_symbols)


I think we should set normalized to False, and more importantly control_symbols should be special no? (I don't remember which ones should be skipped during decoding)
Also adding special and non special: relate to #30574

src/transformers/convert_slow_tokenizer.py

ArthurZucker · 2024-06-04T15:30:48Z

src/transformers/convert_slow_tokenizer.py

+        add_prefix_space = self.proto.normalizer_spec.add_dummy_prefix
+      #  tokenizer.add_prefix_space = add_prefix_space
+
+        if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None:


Suggested change

if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None:

if hadd_prefix_space:

should we rather test this? Or do some | on original tokenizer?

ArthurZucker · 2024-06-04T15:31:56Z

src/transformers/models/llama/tokenization_llama.py

here it's missing the part where we want to reset the normalizer.add_dummy_prefix_space when we don save_pretrained + a small test that shows this works as expected

ArthurZucker · 2024-06-04T15:48:27Z

src/transformers/models/llama/tokenization_llama_fast.py

+        #TODO: ita
+        self.add_prefix_space = add_prefix_space
+        # if add_prefix_space is not None:
+        #     kwargs["from_slow"] = True
+
+        if self.force_from_slow() is True:
            kwargs["from_slow"] = True


IMO should be called in PreTrainedTokenizerFast

ArthurZucker · 2024-06-04T15:49:08Z

src/transformers/tokenization_utils_fast.py

+        elif fast_tokenizer_file is not None:  # and not from_slow:
+            # We have a serialization from tokenizers which let us directly build the backend
+            fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)


why are we changing the order of priority here?

I'll check the order but it is to allow converting without sentencepiece installed

ArthurZucker · 2024-06-04T15:49:32Z

src/transformers/tokenization_utils_fast.py

+    def update_normalizer(self):
+        """Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings."""
+        sequence = []
+        if getattr(self, "legacy", True):
+            if getattr(self, "add_prefix_space", True):
+                sequence += [normalizers.Prepend(prepend="▁")]
+            sequence += [normalizers.Replace(pattern=" ", content="▁")]
+            self._tokenizer.normalizer = normalizers.Sequence(sequence)
+
+        elif not getattr(self, "legacy", True):
+            self._tokenizer.normalizer = normalizers.Sequence(sequence) #TODO:ita2
+
+
+    def update_pre_tokenizer(self):
+        """Updates the underlying pre-tokenizer with the current `add_prefix_space` setting."""
+        sequence = []
+        if getattr(self, "add_prefix_space", None) == None:
+            if getattr(self._tokenizer, "normalizer", None) == None:
+                return
+            curr_normalizer = json.loads(self._tokenizer.normalizer.__getstate__().decode('utf-8'))
+            if 'normalizers' not in curr_normalizer:
+                return
+            prepend_normalizer = [n for n in curr_normalizer['normalizers'] if n['type'] == 'Prepend']
+            if prepend_normalizer:
+                self.add_prefix_space = True
+            else:
+                return
+        elif getattr(self, "add_prefix_space") == False:
+            prepend_scheme = "never"
+
+        if getattr(self, "add_prefix_space", True):
+            prepend_scheme = "always"
+            if not getattr(self, "legacy", True):
+                prepend_scheme = "first"
+        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement="▁", prepend_scheme=prepend_scheme,
+                                                                 split=False)
+        self.update_normalizer()
+
+
+
+    def update_post_processor(self):
+        """
+        Updates the underlying post processor with the current `bos_token` and `eos_token`.
+        """
+        bos = self.bos_token
+        bos_token_id = self.bos_token_id
+        if bos is None and self.add_bos_token:
+            raise ValueError("add_bos_token = True but bos_token = None")
+
+        eos = self.eos_token
+        eos_token_id = self.eos_token_id
+        if eos is None and self.add_eos_token:
+            raise ValueError("add_eos_token = True but eos_token = None")
+
+        single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
+        pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
+
+        special_tokens = []
+        if self.add_bos_token:
+            special_tokens.append((bos, bos_token_id))
+        if self.add_eos_token:
+            special_tokens.append((eos, eos_token_id))
+        self._tokenizer.post_processor = processors.TemplateProcessing(
+            single=single, pair=pair, special_tokens=special_tokens
+        )


Nice! It's mostly missing ... testssssss

src/transformers/tokenization_utils_fast.py

ArthurZucker · 2024-06-04T15:53:22Z

src/transformers/tokenization_utils_fast.py

+        if type(self) == PreTrainedTokenizerFast and all(item in kwargs for item in ["add_bos_token", "add_eos_token", "eos_token", "bos_token"]):
+            self.add_bos_token = kwargs.get("add_bos_token")
+            self.add_eos_token = kwargs.get("add_eos_token")
+            self.update_post_processor()


potentially this needs to be called when the eos and bos are updated! 🤗

where are they updated? It looks like only on initialization?

If someone calls add_special_tokens, or we go into eos_token_id's setter and eos_token's setter 😉

…ating

src/transformers/models/t5/tokenization_t5.py

src/transformers/tokenization_utils_fast.py

src/transformers/models/t5/tokenization_t5.py

src/transformers/tokenization_utils_fast.py

itazap linked an issue May 21, 2024 that may be closed by this pull request

SPMConverter does not always add the user defined symbol -> slow fast is thus not equivalent #30824

Closed

itazap force-pushed the 30824-spmconverter-user-defined-symbol branch 4 times, most recently from a979ad9 to 529e2be Compare May 22, 2024 12:00

itazap force-pushed the 30824-spmconverter-user-defined-symbol branch 2 times, most recently from 6bda30c to b0b28f5 Compare May 31, 2024 11:15

Ita Zaporozhets and others added 6 commits May 31, 2024 13:15

adding user defined tokens #30824

24ea0cd

add user defined symbols to all tokenizers from SpmConverter

d1ea757

add comment

31fbe4f

more general approach

7afb159

draft pr

896b7d1

add t5

79ce5bb

itazap force-pushed the 30824-spmconverter-user-defined-symbol branch from b0b28f5 to 79ce5bb Compare May 31, 2024 11:16

itazap added 8 commits May 31, 2024 13:50

t5

fdb63e2

utils update

ff5974b

readd cases

84143a2

draft

c416522

edit

d92822e

fix

74e78f1

moving cases around:

146e8f9

t5 mod

b6de569

itazap commented Jun 3, 2024

View reviewed changes

src/transformers/models/llama/tokenization_llama_fast.py Outdated Show resolved Hide resolved

itazap requested a review from ArthurZucker June 4, 2024 08:38

itazap added 3 commits June 4, 2024 11:00

add eos and bos token arg support

a8694bc

rm change from unaffected file

7f3b798

typo

2751373

ArthurZucker reviewed Jun 4, 2024

View reviewed changes

updates from feedback, consider different pretokenizer types when upd…

16eeb0c

…ating

reverting llama and t5 changes

7e42130

itazap commented Jun 6, 2024

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

itazap commented Jun 6, 2024

View reviewed changes

src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved

itazap added 5 commits June 7, 2024 09:41

SPLITTING PR: moving our user defined symbols issue #30824

874c3f6

changes to _tokenizer updates in fast

5f9b5b1

SPLITTING PR: moving bos/eos token logic to new PR

79507ad

cleaning PR

d28bdde

adding logic for updating normalizer

12940f6

itazap commented Jun 9, 2024

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Show resolved Hide resolved

itazap commented Jun 9, 2024

View reviewed changes

src/transformers/tokenization_utils_fast.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding user defined tokens #30824 #30929

adding user defined tokens #30824 #30929

itazap commented May 21, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 21, 2024

ArthurZucker Jun 4, 2024

ArthurZucker Jun 4, 2024

ArthurZucker Jun 4, 2024

ArthurZucker Jun 4, 2024

ArthurZucker Jun 4, 2024

itazap Jun 5, 2024 •

edited

Loading

ArthurZucker Jun 4, 2024

ArthurZucker Jun 4, 2024

itazap Jun 5, 2024

ArthurZucker Jun 6, 2024

	if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None:
	if hadd_prefix_space:

adding user defined tokens #30824 #30929

Are you sure you want to change the base?

adding user defined tokens #30824 #30929

Conversation

itazap commented May 21, 2024 • edited Loading

Tasks

Who can review?

HuggingFaceDocBuilderDev commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itazap Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itazap commented May 21, 2024 •

edited

Loading

itazap Jun 5, 2024 •

edited

Loading