Update generate() to work like fit() and predict() #932

mattdangerw · 2023-03-28T03:07:01Z

This updates generate to feel more like fit/predict/evaluate.

Inputs to generate can be a dataset, or raw tensors. Inputs can be preprocessed or not depending on if the model has a preprocessor layer attached.

The preprocessing layer is used to preprocess all inputs before generation.

Fixes #911, #912, #913 and #844 (if we go with this approach).

Colab with usage

mattdangerw · 2023-03-28T03:10:11Z

@fchollet @chenmoneygithub opening up a draft of a refactor for generate. We will probably need some tweaks here, but overall I think we need something like this to keep generate() feeling like the other high-level APIs on keras.Model.

Long term, we probably factor a lot of code from this into a common generative task class.

fchollet

Excellent!

keras_nlp/models/gpt2/gpt2_causal_lm.py

chenmoneygithub

Thanks Matt! The proposed workflow is nice, and the code is cleaner with self.packer!

Left some comments on implementation details.

keras_nlp/layers/start_end_packer.py

keras_nlp/models/gpt2/gpt2_causal_lm.py

chenmoneygithub · 2023-03-29T06:01:35Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

-        output = generate_function(prompt, input_mask, min_length)
+        def preprocess(x, y=None, sample_weight=None):
+            if self.preprocessor is not None:
+                return self.preprocessor(x, sequence_length=max_length)


If we use preprocessor here, I would vote for setting add_end_token=False by default, which is only useful for chatbot model.

Currently both add_start_token and add_end_token default to False right?

I actually ended up flipping the defaults on this and removing end token during generation only. Will add a comment below.

chenmoneygithub · 2023-03-29T06:06:10Z

keras_nlp/models/gpt2/gpt2_causal_lm_preprocessor.py

-
-        x = super().call(x)
+        # Tokenize with one extra token to account for the truncation below.
+        sequence_length = (sequence_length or self.sequence_length) + 1


This is a bit odd to me... sequence_length has a higher priority than self.sequence_length because we need to respect max_length in generate method, however for users not having this context, this could be weird. Can we override the sequence_length in generate method?

like change the name from max_length to sequence_length? no strong preferences there.

let's avoid mutating config state on a layer, but not sure what the best approach is. this just seemed "least bad" to me

mattdangerw · 2023-04-04T16:46:11Z

/gcbrun

mattdangerw · 2023-04-04T16:58:29Z

/gcbrun

mattdangerw · 2023-04-04T17:58:35Z

OK! I have addressed comments and tried to get the high level workflows looking the way we want.

The major awkwardness we are facing is that preprocessing for generation and fine-tuning look quite different. This PR makes all preprocessing run through model.preprocessor, and exposes a nice "escape hatch" for generation--by setting preprocessing=None on the model, you can run any manipulation of your input and output IDs.

Overall I think this is worth it, but the fact that we are shoving two preprocessing flows into a single task & preprocessor is a little awkward. Options I can think of...

Expose call time overrides for the preprocessing layer. GPT2CausalLMPreprocessor takes in sequence_length, add_end_token, return_labels at call time to configure the preprocessing. During generation, we pass these arguments to the layer configure the preprocessing the way we want.
Split generation into it's own totally separate task. In some ways, this feels very conceptually clean. GPT2Generator would not be fittable. GPT2CausalLM would be specifically for pre-training/fine-tuning and not have generate(). Each task would have a single forward pass. But this would definitely add more labor to an example which first fine-tunes on a dataset, then generates outputs.
Expose two separate preprocessing layers on the language model task one for generation and one for training.

Currently I am going with 1., mainly because it is the least disruptive to what we currently have set up.

mattdangerw · 2023-04-04T23:53:48Z

/gcbrun

mattdangerw · 2023-04-05T00:08:27Z

/gcbrun

chenmoneygithub

Thanks! The functionality looks good, left some initial comments.

keras_nlp/layers/start_end_packer.py

chenmoneygithub · 2023-04-05T01:13:26Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

-        "But I watch youtube while coding!",
-    ]
-    ds = tf.data.Dataset.from_tensor_slices(features).batch(2)
+    # Prompt with 50256, the `"<|endoftext|>"` token id.


The purpose of this prompt could be a bit unclear to readers, my guess is we are showing "generate will still do its work if prompt has "<|endoftext|>"", should we reflect it in the comment?

Oh actually I think example I was showing calling generate without preprocessing. So this has nothing to do with our tokenizer.

I updated the example a bit for clarity and remove the endoftext part.

keras_nlp/models/gpt2/gpt2_causal_lm.py

keras_nlp/models/gpt2/gpt2_preprocessor.py

mattdangerw · 2023-04-06T02:47:31Z

Comments addressed!

mattdangerw · 2023-04-06T02:47:56Z

/gcbrun

mattdangerw · 2023-04-06T21:09:46Z

/gcbrun

keras_nlp/models/gpt2/gpt2_causal_lm.py

jbischof

Thanks! This seems very cool but I've fallen a bit behind on some context. Most of my questions are asking about simplifications.

Also: How much of this code is specific to GPT-2? Would it makes sense to put this in a base class or does this make sense to rewrite from scratch for OPT?

keras_nlp/models/gpt2/gpt2_causal_lm.py

jbischof · 2023-04-06T22:34:52Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

+                `tf.Tensor` or `tf.data.Dataset` with keys `"token_ids"` and
+                `"padding_mask"`.
+            max_length: int. The max length of the generated sequence.
+            batch_size: int. Only pass if `inputs` is a `tf.Tensor`. If set,


What if the input tensors themselves are batched?

The user friendly thing to do is just to throw in that case. fit() and predict() do not handle batched tensor/numpy input.

If you attempt bert_classifier.predict(batched_tensors) you would get an error IIUC. Or any vanilla Keras model for that matter.

We could definitely add an error message.

I see, so one batch at most?

Well I guess it depend on what you mean by "batched". Do you mean a tensor with shape (num_batches, batch_size, feature shapes) or do you mean a python list of tensor batches.

The former would not work in fit() or predict() etc. The latter might for vanilla Keras? I would need to check.

The main modes of input to the high-level Keras APIs are either a batched dataset, or a single numpy/tensor with shape (num_sample, feature shapes) and a batch_size provided separately IIUC.

So you can't call predict on a batch? Asking for a friend

It seems that predict has no issues working on batched tensors (colab)

keras_nlp/models/gpt2/gpt2_causal_lm.py

mattdangerw · 2023-04-06T23:50:51Z

Also: How much of this code is specific to GPT-2? Would it makes sense to put this in a base class or does this make sense to rewrite from scratch for OPT?

Definitely a good idea. No way we can support this much complexity for each class.

I've opened #868 for it a bit ago. My plan was to keep this as a follow up.

chenmoneygithub

Approved!

Dropped some nonblocking comments, let's keep discussing and do roll forward fixes.

keras_nlp/models/gpt2/gpt2_causal_lm.py

chenmoneygithub · 2023-04-07T20:07:41Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

-        prompt,
-        max_length,
+        inputs,
+        max_length=None,


Sorry for the back and forth, but I am starting to feel give max_length a default to model capacity could lead to a bad UX. The vanilla GPT2 won't stop until 1024, so users would need to wait for a while to get results back. But anyway let's get this PR in, and I can open a few followups.

Let's chat! I think we should think a little more generally and long term. How many people actually want exactly 50 unconditioned tokens for GPT2. It's kinda toy I think?

Most real workflows will usually involve fine-tuning on sequences that terminate in someway. E.g. build a summarizer, or a chatbot responder, or a translator. This get's even more true when we think about seq2seq models like T5 and BART, which will follow the same UX we are establishing here.

We want to choose a default that will scale well towards the future states of the library, and I am skeptical that requiring a max_length in all cases is a good idea.

chenmoneygithub · 2023-04-07T20:13:21Z

keras_nlp/models/gpt2/gpt2_preprocessor.py

+        sample_weight: Any label weight data. Will be passed through unaltered.
+        sequence_length: Pass to override the configured `sequence_length` of
+            the layer.
+        add_start_token: Pass to override the configure value of


(cannot remember if I have posted this or not...)

So in call and init, we both have add_start_token, but they are actually different types. This could lead to some confusion.

There was a comment chain above that confused me, but basically if we go with this override approach, we need three possible types in call so it can act as an actual override. None default use configured value, True override to true, False override to false. If you only accept True/False here, you have made the init argument obsolete right?

This updates generate to feel more like fit/predict/evaluate. Inputs to generate can be a dataset, or raw tensors. Inputs can be preprocessed or not depending on if the model has a preprocessor layer attached. The preprocessing layer is used to preprocess all inputs before generation.

mattdangerw · 2023-04-10T19:33:26Z

/gcbrun

jbischof

Seems reasonable just a few questions

keras_nlp/models/gpt2/gpt2_causal_lm.py

jbischof · 2023-04-11T00:27:07Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

+                is attached to the model, inputs should instead be a nested
+                `tf.Tensor` or `tf.data.Dataset` with keys `"token_ids"` and
+                `"padding_mask"`.
+            max_length: Optional. int. The max length of the generated sequence.


But what if there's no preprocessor? In that case this arg appears not to work.

Yeah, this arg is worth discussing. max_length is really conceptually a preprocessing arg and does absolutely nothing if we are passing preprocessed inputs, which should already have shape (batch_size, max_length).

This feels like what you would call a denomalized argument somewhat. In that sequence_length on the preprocessor and max_length here are really setting the same thing.

We could...

Think about removing it, though I would do that as a follow perhaps.

Document that it does nothing when preprocessor is None.

Throw an error if inputs.shape[1] != max_length.

Thoughts?

I think I will go with the lightweight approach of documenting for now.

Sight edit for clarity:

""" If `preprocessor` is `None`, `inputs` should be padded to the desired maximum length and this argument will be ignored. """

mattdangerw · 2023-04-11T05:36:55Z

/gcbrun

keras_nlp/models/gpt2/gpt2_causal_lm.py

mattdangerw · 2023-04-12T22:13:48Z

/gcbrun

jbischof

Thanks for the hard work here!

jbischof · 2023-04-12T23:00:46Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

+                is attached to the model, inputs should instead be a nested
+                `tf.Tensor` or `tf.data.Dataset` with keys `"token_ids"` and
+                `"padding_mask"`.
+            max_length: Optional. int. The max length of the generated sequence.


Sight edit for clarity:

""" If `preprocessor` is `None`, `inputs` should be padded to the desired maximum length and this argument will be ignored. """

mattdangerw · 2023-04-12T23:34:32Z

/gcbrun

mattdangerw requested review from chenmoneygithub and fchollet March 28, 2023 03:07

fchollet reviewed Mar 28, 2023

View reviewed changes

chenmoneygithub suggested changes Mar 29, 2023

View reviewed changes

mattdangerw force-pushed the generate-accepts-datasets branch from db7d1c0 to d5d1b6d Compare April 3, 2023 23:37

mattdangerw marked this pull request as ready for review April 4, 2023 16:50

mattdangerw force-pushed the generate-accepts-datasets branch from 9ca7d8f to b09ae70 Compare April 4, 2023 23:53

mattdangerw force-pushed the generate-accepts-datasets branch from b09ae70 to 5c05dd7 Compare April 5, 2023 00:06

chenmoneygithub suggested changes Apr 5, 2023

View reviewed changes

mattdangerw force-pushed the generate-accepts-datasets branch from d6947f4 to 3059061 Compare April 6, 2023 02:47

mattdangerw force-pushed the generate-accepts-datasets branch from 3059061 to e69b262 Compare April 6, 2023 21:09

jbischof reviewed Apr 6, 2023

View reviewed changes

keras_nlp/models/gpt2/gpt2_causal_lm.py Outdated Show resolved Hide resolved

mattdangerw changed the title ~~Update generate to accept datasets~~ Update generate() to work like fit() and predict() Apr 6, 2023

jbischof reviewed Apr 6, 2023

View reviewed changes

chenmoneygithub approved these changes Apr 7, 2023

View reviewed changes

mattdangerw force-pushed the generate-accepts-datasets branch from e69b262 to 7137316 Compare April 10, 2023 19:32

jbischof reviewed Apr 11, 2023

View reviewed changes

Address comments

0b4e9ec

jbischof reviewed Apr 12, 2023

View reviewed changes

keras_nlp/models/gpt2/gpt2_causal_lm.py Outdated Show resolved Hide resolved

Address comments

8eefbc1

jbischof approved these changes Apr 12, 2023

View reviewed changes

One last edit

a79bb38

mattdangerw merged commit aaa6d23 into keras-team:master Apr 13, 2023

Update generate() to work like fit() and predict() #932

Update generate() to work like fit() and predict() #932

Uh oh!

Conversation

mattdangerw commented Mar 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw commented Mar 28, 2023

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw commented Apr 4, 2023

Uh oh!

mattdangerw commented Apr 5, 2023

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattdangerw commented Apr 6, 2023

Uh oh!

mattdangerw commented Apr 6, 2023

Uh oh!

mattdangerw commented Apr 6, 2023

Uh oh!

Uh oh!

jbischof left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Mar 28, 2023 •

edited

Loading

mattdangerw Mar 29, 2023 •

edited

Loading

mattdangerw Apr 4, 2023 •

edited

Loading

mattdangerw Apr 6, 2023 •

edited

Loading