Handle OOV token in XLMRoBERTaTokenizer's token_to_id function #968

abheesht17 · 2023-04-06T13:40:01Z

No description provided.

mattdangerw · 2023-04-07T01:11:00Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

        """Convert a string token to an integer id."""
+
+        # OOV token
+        if token not in self.get_vocabulary():


Does sentencepiece not handle the oov token in this case? If we can just remap the output that would be better.

This would require loading the full vocab list into a python list and then iterating over it in its entirety. That's probably quite slow!

mattdangerw · 2023-04-07T17:11:45Z

keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py

            return self._vocabulary_prefix.index(token)

-        return int(self._sentence_piece.string_to_id(token).numpy()) + 1
+        spm_token_id = int(self._sentence_piece.string_to_id(token).numpy())


let's compare in tensor space, convert to int at the end?

mattdangerw · 2023-04-12T17:38:10Z

/gcbrun

mattdangerw

lgtm! thanks

Handle OOV token in XLMRoBERTaTokenizer's token_to_id function

4b739c6

abheesht17 requested a review from mattdangerw April 6, 2023 13:40

Add special token check in UT

a95b8ac

abheesht17 mentioned this pull request Apr 6, 2023

Add an XLMRobertaMaskedLM task model #950

Merged

mattdangerw reviewed Apr 7, 2023

View reviewed changes

abheesht17 added 2 commits April 7, 2023 13:20

Optimise

3eabd3c

Format code

8fb746a

mattdangerw reviewed Apr 7, 2023

View reviewed changes

Simplify

978cf09

mattdangerw approved these changes Apr 12, 2023

View reviewed changes

mattdangerw merged commit 28bfb04 into keras-team:master Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle OOV token in XLMRoBERTaTokenizer's token_to_id function #968

Handle OOV token in XLMRoBERTaTokenizer's token_to_id function #968

Uh oh!

abheesht17 commented Apr 6, 2023

Uh oh!

mattdangerw Apr 7, 2023

Uh oh!

mattdangerw Apr 7, 2023

Uh oh!

abheesht17 Apr 7, 2023

Uh oh!

mattdangerw commented Apr 12, 2023

Uh oh!

mattdangerw left a comment

Uh oh!

Uh oh!

Handle OOV token in XLMRoBERTaTokenizer's token_to_id function #968

Handle OOV token in XLMRoBERTaTokenizer's token_to_id function #968

Uh oh!

Conversation

abheesht17 commented Apr 6, 2023

Uh oh!

mattdangerw Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

abheesht17 Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 12, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!