Add a SentencePiece tokenizer #218

mattdangerw · 2022-06-02T00:13:26Z

Fixes #27

chenmoneygithub

Thanks for the PR and sorry for the late reply!

Left some comments.

keras_nlp/tokenizers/sentence_piece_tokenizer.py

chenmoneygithub · 2022-06-14T00:17:47Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+
+        # Keras cannot serialize a bytestring, so we base64 encode the model
+        # byte array for saving.
+        self.model_bytes = base64.b64encode(model_bytes).decode("ascii")


why do we apply decode("ascii") at encoding time, but not having it above?

I'm confused at what you mean by above? We talked about this a bit earlier I think, this is only for serialization.

keras_nlp/tokenizers/sentence_piece_tokenizer.py

chenmoneygithub · 2022-06-14T00:59:43Z

keras_nlp/tokenizers/sentence_piece_tokenizer_test.py

+        self.assertAllEqual(model_output, ["the quick brown fox."])
+
+    def test_from_file(self):
+        model_file = os.path.join(self.get_temp_dir(), "model.txt")


where is this model.txt from?

I write the file two lines below inside the test.

mattdangerw · 2022-06-14T19:03:28Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+    argument, which should be either an integer or string type.
+
+    Args:
+        model_file: A path to a SentencePiece serialized model file. One


merge into one argument, rename from model to something else?

Trying to think of names here. "serialized" might be useful because this is either a "serialized" might be useful here as this is either a binary file or a byte string...

serialized_settings, serialized_config, serialized_proto, serialized_spec, settings, config, proto, spec

Do any of these sound good @fchollet ? Here's the proto message if helpful https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto

want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?

Yeah, this was a not to self written during a meeting with Francois. His preference is to merge the arguments into one (which match TextVectorization and WordPiece). The implementation might be a little more tricky, but the experience more consistent.

Then there's the separate question of what to name. We discussed moving away from the word "model", which means something specific in Keras we do not mean here. (that's the question above)

Right -- maybe just state since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.

Talked today. Let's make a single overloaded arg called proto. This is either a serialized proto bytestring or a proto filepath.

chenmoneygithub · 2022-06-14T23:48:26Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+    argument, which should be either an integer or string type.
+
+    Args:
+        model_file: A path to a SentencePiece serialized model file. One


want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?

chenmoneygithub · 2022-06-14T23:53:02Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+
+        # Keras cannot serialize a bytestring, so we base64 encode the model
+        # byte array for saving.
+        self.model_bytes = base64.b64encode(model_bytes).decode("ascii")


Re the previous comment on ascii.

My confusion is here it is base64.b64encode(model_bytes).decode("ascii"), but on line 115 it is model_bytes = base64.b64decode(model_bytes), which does not mention "ascii". Is this by intention?

Ah I see. I don't think it matters. base64.b64encode will return a byte array so we have to convert it to a string for serialization. But I believe base64.b64decode can handle a string as input.

fchollet · 2022-06-15T15:26:42Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+    argument, which should be either an integer or string type.
+
+    Args:
+        model_file: A path to a SentencePiece serialized model file. One


Right -- maybe just state since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.

fchollet · 2022-06-15T17:48:36Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+
+        if model_file is None and model_bytes is None:
+            raise ValueError(
+                "One of `model_file` or `model_bytes` must be set. "


A single requirement argument will be easier

fchollet · 2022-06-15T17:49:07Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+                # Ideally the model would be saved as a file asset in
+                # the saved model. We have no good way to support this
+                # currently, so we save the model string in the config.
+                "model_bytes": self.model_bytes,


The keyword should be updated here as well. Seems fine to save the state in the config since it's a required constructor argument.

keras_nlp/tokenizers/sentence_piece_tokenizer.py

mattdangerw requested review from chenmoneygithub and fchollet and removed request for chenmoneygithub June 2, 2022 00:13

mattdangerw changed the title ~~Add SentencePiece tokenizer~~ Add a SentencePiece tokenizer Jun 2, 2022

chenmoneygithub suggested changes Jun 14, 2022

View reviewed changes

mattdangerw commented Jun 14, 2022

View reviewed changes

chenmoneygithub reviewed Jun 14, 2022

View reviewed changes

fchollet reviewed Jun 15, 2022

View reviewed changes

mattdangerw added 3 commits June 23, 2022 18:07

Add SentencePiece tokenizer

4f9fc71

Address review comments

0bc308e

Update sentencepiece tokenizer with new arg name

57868da

mattdangerw force-pushed the sentencepiece branch from 1e2216b to 57868da Compare June 24, 2022 01:09

mattdangerw added 2 commits June 27, 2022 14:55

Fix serialization

2f1b52d

Improve accessors

f1fc16a

mattdangerw force-pushed the sentencepiece branch from 11ce06b to f1fc16a Compare June 27, 2022 22:08

chenmoneygithub approved these changes Jun 27, 2022

View reviewed changes

fchollet reviewed Jun 27, 2022

View reviewed changes

Address review comments

e6b560f

fchollet approved these changes Jun 28, 2022

View reviewed changes

Refactor docstrings

9eefddd

mattdangerw merged commit 84bed77 into keras-team:master Jun 28, 2022

Add a SentencePiece tokenizer #218

Add a SentencePiece tokenizer #218

Uh oh!

Conversation

mattdangerw commented Jun 2, 2022

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattdangerw Jun 15, 2022 •

edited

Loading