Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Ended Text Generation Guide KerasNLP #956

Closed
wants to merge 3 commits into from

Conversation

aflah02
Copy link

@aflah02 aflah02 commented Jul 1, 2022

This guide compares byte and unicode tokenizers for text generation
@mattdangerw Raised a PR based on our discussion, it's open for review now!
Referring to the README, I will add the generated files after approval.
Reference: keras-team/keras-nlp#191

Copy link
Member

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some preliminary comments for now.

@mattdangerw please take a look at the example as well.

- A LSTM model to generate text character-by-character
- Decoding using `keras_nlp.utils.greedy_search` utility

This tutorial will be pretty useful and will be a good starting point for learning about KerasNLP and how to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid subjective statements such as "This tutorial will be pretty useful".

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved
examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

x_unicode = keras.Input(shape=(None,))
e_unicode = tf.keras.layers.Embedding(input_dim = 591, output_dim = 128)(x_unicode)
y_unicode = keras.layers.LSTM(128)(e_unicode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than LSTM prefer using the TransformerEncoder.

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some initial comments

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved
examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved
examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved
examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

"""
EXAMPLE 1
PROMPT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this example would be most compelling with a CJK dataset or some sort. That would really hammer home the tradeoff of a large embedding vs a much longer input sequence. But I don't know of a good data source we could swap in here.

Just leaving this comment in case someone happens to know a dataset :)

to a utf-8 format to be interpreted well
"""

def decode_sequences(input_sentences, model, generation_length = 50):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next couple code blocks could be simplified a bit. Maybe let's use a fixed shorter prompt, "Rocinante", or something like that, and avoid all the to_tensor shape manipulation. You can probably leave max_length as a constant directly in greed_search rather then passing it around. And if you are only starting with a scalar promp, you should be able to avoid all the [0] indexing you have.

print(ByteTokenizer.detokenize(byte_tokenized_text[:50]))

"""
We can see post tokenization the datasets now differ which is to be expected as byte tokenization and unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just phrase this as

Let's try tokenizing the the input string "señora mía".

print(codepoint_tokenizer("señora mía))
print(byte_tokenizer("señora mía))

This captures the tradeoff between the two tokenizer. The byte tokenizer will handle any text with only 256 output ids, at the cost of encoding to a longer sequence. The unicode tokenizer will produce shorted sequence, at the cost of a larger id space (and larger embeddings).

OUTPUT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltrae. Ɏ y el como son prés como sancho de la carón

Model: model_byte
OUTPUT: que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra de su don quijote—, que su perreso de su cab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep looking at where these erroneous characters are coming from â�, maybe we need a longer dataset? Different hyperparameters? Maybe trying the transfomer will help?

Especially with greed search, we should at least be able to train to the point where we output a valid set of characters that are actually in the source text.

@pcoet
Copy link
Collaborator

pcoet commented Aug 16, 2023

@aflah02 Thanks for the PR. Are you planning to implement the suggestions from the feedback? Let us know if you're still working on this. Otherwise we'll close the request. Thanks!

@aflah02
Copy link
Author

aflah02 commented Aug 17, 2023

@pcoet Sorry I'm blocked due to other commitments and would be unable to finish this PR in the near future. Closing it for now then

Incase someone wants to pick this up:
The reason I was unable to make progress was because the models often kept producing gibberish and I was unable to solve that issue even with different configs.

@aflah02 aflah02 closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants