Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding user defined tokens #30824 #30929

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

itazap
Copy link
Contributor

@itazap itazap commented May 21, 2024

Fixes #30824 #30947

Tasks

  • fix converter to handle user_defined_symbols
  • create necessary flags for user_defined_symbols
  • update docs
  • test slow == fast
  • add test without sentencepiece

Who can review?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the 30824-spmconverter-user-defined-symbol branch 4 times, most recently from a979ad9 to 529e2be Compare May 22, 2024 12:00
@itazap itazap force-pushed the 30824-spmconverter-user-defined-symbol branch 2 times, most recently from 6bda30c to b0b28f5 Compare May 31, 2024 11:15
@itazap itazap force-pushed the 30824-spmconverter-user-defined-symbol branch from b0b28f5 to 79ce5bb Compare May 31, 2024 11:16
@itazap itazap requested a review from ArthurZucker June 4, 2024 08:38
control_symbols = [
AddedToken(token, normalized=True, special=False) for token in self.proto.trainer_spec.control_symbols
]
tokenizer.add_tokens(user_defined_symbols + control_symbols)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set normalized to False, and more importantly control_symbols should be special no? (I don't remember which ones should be skipped during decoding)
Also adding special and non special: relate to #30574

src/transformers/convert_slow_tokenizer.py Show resolved Hide resolved
src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
add_prefix_space = self.proto.normalizer_spec.add_dummy_prefix
# tokenizer.add_prefix_space = add_prefix_space

if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(self.original_tokenizer,"add_prefix_space") and self.original_tokenizer.add_prefix_space is not None:
if hadd_prefix_space:

should we rather test this? Or do some | on original tokenizer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it's missing the part where we want to reset the normalizer.add_dummy_prefix_space when we don save_pretrained + a small test that shows this works as expected

Comment on lines 155 to 161
#TODO: ita
self.add_prefix_space = add_prefix_space
# if add_prefix_space is not None:
# kwargs["from_slow"] = True

if self.force_from_slow() is True:
kwargs["from_slow"] = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO should be called in PreTrainedTokenizerFast

Comment on lines 118 to 120
elif fast_tokenizer_file is not None: # and not from_slow:
# We have a serialization from tokenizers which let us directly build the backend
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing the order of priority here?

Copy link
Contributor Author

@itazap itazap Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the order but it is to allow converting without sentencepiece installed

Comment on lines 882 to 946
def update_normalizer(self):
"""Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings."""
sequence = []
if getattr(self, "legacy", True):
if getattr(self, "add_prefix_space", True):
sequence += [normalizers.Prepend(prepend="▁")]
sequence += [normalizers.Replace(pattern=" ", content="▁")]
self._tokenizer.normalizer = normalizers.Sequence(sequence)

elif not getattr(self, "legacy", True):
self._tokenizer.normalizer = normalizers.Sequence(sequence) #TODO:ita2


def update_pre_tokenizer(self):
"""Updates the underlying pre-tokenizer with the current `add_prefix_space` setting."""
sequence = []
if getattr(self, "add_prefix_space", None) == None:
if getattr(self._tokenizer, "normalizer", None) == None:
return
curr_normalizer = json.loads(self._tokenizer.normalizer.__getstate__().decode('utf-8'))
if 'normalizers' not in curr_normalizer:
return
prepend_normalizer = [n for n in curr_normalizer['normalizers'] if n['type'] == 'Prepend']
if prepend_normalizer:
self.add_prefix_space = True
else:
return
elif getattr(self, "add_prefix_space") == False:
prepend_scheme = "never"

if getattr(self, "add_prefix_space", True):
prepend_scheme = "always"
if not getattr(self, "legacy", True):
prepend_scheme = "first"
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement="▁", prepend_scheme=prepend_scheme,
split=False)
self.update_normalizer()



def update_post_processor(self):
"""
Updates the underlying post processor with the current `bos_token` and `eos_token`.
"""
bos = self.bos_token
bos_token_id = self.bos_token_id
if bos is None and self.add_bos_token:
raise ValueError("add_bos_token = True but bos_token = None")

eos = self.eos_token
eos_token_id = self.eos_token_id
if eos is None and self.add_eos_token:
raise ValueError("add_eos_token = True but eos_token = None")

single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"

special_tokens = []
if self.add_bos_token:
special_tokens.append((bos, bos_token_id))
if self.add_eos_token:
special_tokens.append((eos, eos_token_id))
self._tokenizer.post_processor = processors.TemplateProcessing(
single=single, pair=pair, special_tokens=special_tokens
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's mostly missing ... testssssss

src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
if type(self) == PreTrainedTokenizerFast and all(item in kwargs for item in ["add_bos_token", "add_eos_token", "eos_token", "bos_token"]):
self.add_bos_token = kwargs.get("add_bos_token")
self.add_eos_token = kwargs.get("add_eos_token")
self.update_post_processor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially this needs to be called when the eos and bos are updated! 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are they updated? It looks like only on initialization?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone calls add_special_tokens, or we go into eos_token_id's setter and eos_token's setter 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SPMConverter does not always add the user defined symbol -> slow fast is thus not equivalent
3 participants