New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open Ended Text Generation Guide KerasNLP #956

Closed

aflah02 wants to merge 3 commits into keras-team:master from aflah02:TextGenGuide

aflah02 commented Jul 1, 2022 •

edited

This guide compares byte and unicode tokenizers for text generation
@mattdangerw Raised a PR based on our discussion, it's open for review now!
Referring to the README, I will add the generated files after approval.
Reference: keras-team/keras-nlp#191

aflah02 added 2 commits

July 1, 2022 17:37


          Initial Version

c4f9935


          Fixed Typos

d97a27e

fchollet reviewed

View reviewed changes

Member

fchollet left a comment

Thanks for the PR! Some preliminary comments for now.

@mattdangerw please take a look at the example as well.

examples/nlp/character_and_byte_level_text_generation.py Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated

+              - A LSTM model to generate text character-by-character
+              - Decoding using `keras_nlp.utils.greedy_search` utility
+              This tutorial will be pretty useful and will be a good starting point for learning about KerasNLP and how to

Member

fchollet Jul 7, 2022

Avoid subjective statements such as "This tutorial will be pretty useful".

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated

+              x_unicode = keras.Input(shape=(None,))
+              e_unicode = tf.keras.layers.Embedding(input_dim = 591, output_dim = 128)(x_unicode)
+              y_unicode = keras.layers.LSTM(128)(e_unicode)

Member

fchollet Jul 7, 2022

Rather than LSTM prefer using the TransformerEncoder.

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

mattdangerw requested changes

View reviewed changes

Member

mattdangerw left a comment

Thanks! Left some initial comments

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py Outdated Show resolved Hide resolved

examples/nlp/character_and_byte_level_text_generation.py

+              """
+              EXAMPLE 1
+              PROMPT:   que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra

Member

mattdangerw Jul 8, 2022

Overall I think this example would be most compelling with a CJK dataset or some sort. That would really hammer home the tradeoff of a large embedding vs a much longer input sequence. But I don't know of a good data source we could swap in here.

Just leaving this comment in case someone happens to know a dataset :)

examples/nlp/character_and_byte_level_text_generation.py

+              to a utf-8 format to be interpreted well
+              """
+              def decode_sequences(input_sentences, model, generation_length = 50):

Member

mattdangerw Jul 8, 2022

The next couple code blocks could be simplified a bit. Maybe let's use a fixed shorter prompt, "Rocinante", or something like that, and avoid all the to_tensor shape manipulation. You can probably leave max_length as a constant directly in greed_search rather then passing it around. And if you are only starting with a scalar promp, you should be able to avoid all the [0] indexing you have.

examples/nlp/character_and_byte_level_text_generation.py

+              print(ByteTokenizer.detokenize(byte_tokenized_text[:50]))
+              """
+              We can see post tokenization the datasets now differ which is to be expected as byte tokenization and unicode

Member

mattdangerw Jul 8, 2022

maybe just phrase this as

Let's try tokenizing the the input string "señora mía".

print(codepoint_tokenizer("señora mía))
print(byte_tokenizer("señora mía))

This captures the tradeoff between the two tokenizer. The byte tokenizer will handle any text with only 256 output ids, at the cost of encoding to a longer sequence. The unicode tokenizer will produce shorted sequence, at the cost of a larger id space (and larger embeddings).

examples/nlp/character_and_byte_level_text_generation.py

+              OUTPUT:   que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltrae. Ɏ y el como son prés como sancho de la carón
+              Model: model_byte
+              OUTPUT:   que ya su dulce esposo no vivía, rompió los aires con suspiros, hirió los cielos con quejas, maltra de su don quijoteâ, que su perreso de su cab

Member

mattdangerw Jul 8, 2022

We need to keep looking at where these erroneous characters are coming from â�, maybe we need a longer dataset? Different hyperparameters? Maybe trying the transfomer will help?

Especially with greed search, we should at least be able to train to the point where we output a valid set of characters that are actually in the source text.


          Addressed Some Review Comments

b2df8a5

fchollet added the Pending Keras team review label

sachinprasadhs removed the Pending Keras team review label

Collaborator

pcoet commented Aug 16, 2023

@aflah02 Thanks for the PR. Are you planning to implement the suggestions from the feedback? Let us know if you're still working on this. Otherwise we'll close the request. Thanks!

pcoet added the stat:awaiting response from contributor label

Author

aflah02 commented Aug 17, 2023 •

edited

@pcoet Sorry I'm blocked due to other commitments and would be unable to finish this PR in the near future. Closing it for now then

Incase someone wants to pick this up:
The reason I was unable to make progress was because the models often kept producing gibberish and I was unable to solve that issue even with different configs.

aflah02 closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment