New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open Ended Text Generation Guide KerasNLP #956
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Some preliminary comments for now.
@mattdangerw please take a look at the example as well.
- A LSTM model to generate text character-by-character | ||
- Decoding using `keras_nlp.utils.greedy_search` utility | ||
|
||
This tutorial will be pretty useful and will be a good starting point for learning about KerasNLP and how to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid subjective statements such as "This tutorial will be pretty useful".
|
||
x_unicode = keras.Input(shape=(None,)) | ||
e_unicode = tf.keras.layers.Embedding(input_dim = 591, output_dim = 128)(x_unicode) | ||
y_unicode = keras.layers.LSTM(128)(e_unicode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than LSTM prefer using the TransformerEncoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some initial comments
|
||
""" | ||
EXAMPLE 1 | ||
PROMPT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this example would be most compelling with a CJK dataset or some sort. That would really hammer home the tradeoff of a large embedding vs a much longer input sequence. But I don't know of a good data source we could swap in here.
Just leaving this comment in case someone happens to know a dataset :)
to a utf-8 format to be interpreted well | ||
""" | ||
|
||
def decode_sequences(input_sentences, model, generation_length = 50): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next couple code blocks could be simplified a bit. Maybe let's use a fixed shorter prompt, "Rocinante", or something like that, and avoid all the to_tensor shape manipulation. You can probably leave max_length as a constant directly in greed_search rather then passing it around. And if you are only starting with a scalar promp, you should be able to avoid all the [0] indexing you have.
print(ByteTokenizer.detokenize(byte_tokenized_text[:50])) | ||
|
||
""" | ||
We can see post tokenization the datasets now differ which is to be expected as byte tokenization and unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just phrase this as
Let's try tokenizing the the input string "señora mía".
print(codepoint_tokenizer("señora mía))
print(byte_tokenizer("señora mía))
This captures the tradeoff between the two tokenizer. The byte tokenizer will handle any text with only 256 output ids, at the cost of encoding to a longer sequence. The unicode tokenizer will produce shorted sequence, at the cost of a larger id space (and larger embeddings).
OUTPUT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltrae. Ɏ y el como son prés como sancho de la carón | ||
|
||
Model: model_byte | ||
OUTPUT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra de su don quijoteâ, que su perreso de su cab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep looking at where these erroneous characters are coming from â�
, maybe we need a longer dataset? Different hyperparameters? Maybe trying the transfomer will help?
Especially with greed search, we should at least be able to train to the point where we output a valid set of characters that are actually in the source text.
@aflah02 Thanks for the PR. Are you planning to implement the suggestions from the feedback? Let us know if you're still working on this. Otherwise we'll close the request. Thanks! |
@pcoet Sorry I'm blocked due to other commitments and would be unable to finish this PR in the near future. Closing it for now then Incase someone wants to pick this up: |
This guide compares byte and unicode tokenizers for text generation
@mattdangerw Raised a PR based on our discussion, it's open for review now!
Referring to the README, I will add the generated files after approval.
Reference: keras-team/keras-nlp#191