Skip to content

Conversation

mattdangerw
Copy link
Member

Fixes #27

@mattdangerw mattdangerw requested review from chenmoneygithub and fchollet and removed request for chenmoneygithub June 2, 2022 00:13
@mattdangerw mattdangerw changed the title Add SentencePiece tokenizer Add a SentencePiece tokenizer Jun 2, 2022
Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry for the late reply!

Left some comments.


# Keras cannot serialize a bytestring, so we base64 encode the model
# byte array for saving.
self.model_bytes = base64.b64encode(model_bytes).decode("ascii")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we apply decode("ascii") at encoding time, but not having it above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused at what you mean by above? We talked about this a bit earlier I think, this is only for serialization.

self.assertAllEqual(model_output, ["the quick brown fox."])

def test_from_file(self):
model_file = os.path.join(self.get_temp_dir(), "model.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this model.txt from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I write the file two lines below inside the test.

argument, which should be either an integer or string type.

Args:
model_file: A path to a SentencePiece serialized model file. One
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge into one argument, rename from model to something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think of names here. "serialized" might be useful because this is either a "serialized" might be useful here as this is either a binary file or a byte string...

serialized_settings, serialized_config, serialized_proto, serialized_spec, settings, config, proto, spec

Do any of these sound good @fchollet ? Here's the proto message if helpful https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?

Copy link
Member Author

@mattdangerw mattdangerw Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a not to self written during a meeting with Francois. His preference is to merge the arguments into one (which match TextVectorization and WordPiece). The implementation might be a little more tricky, but the experience more consistent.

Then there's the separate question of what to name. We discussed moving away from the word "model", which means something specific in Keras we do not mean here. (that's the question above)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- maybe just state since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked today. Let's make a single overloaded arg called proto. This is either a serialized proto bytestring or a proto filepath.

argument, which should be either an integer or string type.

Args:
model_file: A path to a SentencePiece serialized model file. One
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?


# Keras cannot serialize a bytestring, so we base64 encode the model
# byte array for saving.
self.model_bytes = base64.b64encode(model_bytes).decode("ascii")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the previous comment on ascii.

My confusion is here it is base64.b64encode(model_bytes).decode("ascii"), but on line 115 it is model_bytes = base64.b64decode(model_bytes), which does not mention "ascii". Is this by intention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I don't think it matters. base64.b64encode will return a byte array so we have to convert it to a string for serialization. But I believe base64.b64decode can handle a string as input.

argument, which should be either an integer or string type.

Args:
model_file: A path to a SentencePiece serialized model file. One
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- maybe just state since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.


if model_file is None and model_bytes is None:
raise ValueError(
"One of `model_file` or `model_bytes` must be set. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single requirement argument will be easier

# Ideally the model would be saved as a file asset in
# the saved model. We have no good way to support this
# currently, so we save the model string in the config.
"model_bytes": self.model_bytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword should be updated here as well. Seems fine to save the state in the config since it's a required constructor argument.

@mattdangerw mattdangerw merged commit 84bed77 into keras-team:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a SentencePiece tokenizer layer
3 participants