-
Notifications
You must be signed in to change notification settings - Fork 301
Add a SentencePiece tokenizer #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for the late reply!
Left some comments.
|
||
# Keras cannot serialize a bytestring, so we base64 encode the model | ||
# byte array for saving. | ||
self.model_bytes = base64.b64encode(model_bytes).decode("ascii") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we apply decode("ascii")
at encoding time, but not having it above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused at what you mean by above? We talked about this a bit earlier I think, this is only for serialization.
self.assertAllEqual(model_output, ["the quick brown fox."]) | ||
|
||
def test_from_file(self): | ||
model_file = os.path.join(self.get_temp_dir(), "model.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this model.txt from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I write the file two lines below inside the test.
argument, which should be either an integer or string type. | ||
|
||
Args: | ||
model_file: A path to a SentencePiece serialized model file. One |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge into one argument, rename from model to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to think of names here. "serialized" might be useful because this is either a "serialized" might be useful here as this is either a binary file or a byte string...
serialized_settings
, serialized_config
, serialized_proto
, serialized_spec
, settings
, config
, proto
, spec
Do any of these sound good @fchollet ? Here's the proto message if helpful https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a not to self written during a meeting with Francois. His preference is to merge the arguments into one (which match TextVectorization and WordPiece). The implementation might be a little more tricky, but the experience more consistent.
Then there's the separate question of what to name. We discussed moving away from the word "model", which means something specific in Keras we do not mean here. (that's the question above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- maybe just state
since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked today. Let's make a single overloaded arg called proto
. This is either a serialized proto bytestring or a proto filepath.
argument, which should be either an integer or string type. | ||
|
||
Args: | ||
model_file: A path to a SentencePiece serialized model file. One |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to know more about this - are you proposing to keep only one argument, and let the code decide how to load the model?
|
||
# Keras cannot serialize a bytestring, so we base64 encode the model | ||
# byte array for saving. | ||
self.model_bytes = base64.b64encode(model_bytes).decode("ascii") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the previous comment on ascii.
My confusion is here it is base64.b64encode(model_bytes).decode("ascii")
, but on line 115 it is model_bytes = base64.b64decode(model_bytes)
, which does not mention "ascii". Is this by intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I don't think it matters. base64.b64encode
will return a byte array so we have to convert it to a string for serialization. But I believe base64.b64decode
can handle a string as input.
argument, which should be either an integer or string type. | ||
|
||
Args: | ||
model_file: A path to a SentencePiece serialized model file. One |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- maybe just state
since it's not a vocabulary nor a model. Also the terminology should be consistent across the trainer and the tokenizer. No strong opinion on the name.
|
||
if model_file is None and model_bytes is None: | ||
raise ValueError( | ||
"One of `model_file` or `model_bytes` must be set. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single requirement argument will be easier
# Ideally the model would be saved as a file asset in | ||
# the saved model. We have no good way to support this | ||
# currently, so we save the model string in the config. | ||
"model_bytes": self.model_bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keyword should be updated here as well. Seems fine to save the state in the config since it's a required constructor argument.
1e2216b
to
57868da
Compare
11ce06b
to
f1fc16a
Compare
Fixes #27