Skip to content

Conversation

chenmoneygithub
Copy link
Contributor

There are some cases that BPE needs to treat special tokens as one token, e.g., "<|endoftext|>" in GPT2.

@chenmoneygithub
Copy link
Contributor Author

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! left some initial comments

)
self.assertAllEqual(call_output, expected)

def test_tokenize_with_special_tokens(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested detokenization? That is probably worth checking out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it's actually not affected by this PR.

Here the special token has to be present inside the vocab, such as <|endoftext|>.

if special_tokens:

def build_alt(i):
return " " + "I" * i + "IGuessIamAValidWord" + "d" * i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh this feels a little too janky :)

What if we...

  • Move this function out (don't nest it) for readability.
  • Take in the original special_token.
  • Strip all splittable \s\p{L}\p{N} from the original special token.
  • Add a unique prefix or suffix. f"Ĵ{i}" or something like that.

That is nice in that it won't expand the length of these characters much, and if someone is debugging intermediate state, it will look a little more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha done.

@mattdangerw mattdangerw self-assigned this Mar 30, 2023
@chenmoneygithub
Copy link
Contributor Author

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@chenmoneygithub
Copy link
Contributor Author

/gcbrun

@chenmoneygithub chenmoneygithub merged commit df2e85e into keras-team:master Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants