Skip to content

Conversation

abheesht17
Copy link
Collaborator

No description provided.

@abheesht17 abheesht17 requested a review from mattdangerw April 6, 2023 13:40
"""Convert a string token to an integer id."""

# OOV token
if token not in self.get_vocabulary():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sentencepiece not handle the oov token in this case? If we can just remap the output that would be better.

This would require loading the full vocab list into a python list and then iterating over it in its entirety. That's probably quite slow!

return self._vocabulary_prefix.index(token)

return int(self._sentence_piece.string_to_id(token).numpy()) + 1
spm_token_id = int(self._sentence_piece.string_to_id(token).numpy())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's compare in tensor space, convert to int at the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@mattdangerw
Copy link
Member

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks

@mattdangerw mattdangerw merged commit 28bfb04 into keras-team:master Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants