[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

martin-gorner · 2024-04-25T19:34:18Z

This is a proof of concept PR for the new layer graph cloning API in Keras (keras-team/keras#19600).
It is not meant to be merged as such but provide a tangible use case for the design of the new layer graph cloning API.

The problem to solve was:

XXXCausalLLM(backbone) accepted a Functional backbone but did not use the graph of layers of the backbone.
Instead, XXXCausalLLM implemented its call_with_cache by calling selected layers from the backbone.
If the user provided a backbone with extra layers between TransformerDecoder blocks (or any other layer graph), the graph of the backbone would be disregarded in call_with_cache.
There are several LLM techniques that rely on a modified backbone (ex: control vectors https://arxiv.org/abs/2310.01405) and which could therefore not be used in Keras NLP.

In order to let users implement what they want in the backbone and have call_with_cache still work in XXXCausalLLM, it is necessary to add the caching to the backbone in a Keras Functional way, and respect the Functional layer graph of the backbone.

The new layer graph cloning API can be used:

externally, to provide an easy UX for light-weight edits of pre-trained models, for example to implement control vectors.
internally, to wire caches into the backbone in order to implement call_with_cache in a Keras Functional way

This PR implements a Keras Functional call_with_cache for GPT2 and Gemma.

…This uses the new experimental clone_layer_graph functionality.

…tack

martin-gorner · 2024-04-25T19:53:14Z

For a demo of the fuctionality see this Colab: Model rewiring demo with LLMs.ipynb.

For example, you can insert control vectors into an LLM backbone with this clone_fn applied to the backbone:

def clone_fn(layer, *args, **kwargs):
  if isinstance(layer, keras_nlp.layers.TransformerDecoder):
    x = layer(*args, **kwargs)
    x = ControlVectorLayer()(x)
    return x
  else:
    return layer(*args, **kwargs) # identity

Before/after visualization of the backbone:

martin-gorner · 2024-04-25T20:00:37Z

And here is what a re-wired backbone with caches looks like. Since it is now a proper Keras Functional model, it can be plotted. The layout is not the best but you can see the cache input fed into all layers and an updated cache fed out and collected at the end.

martin-gorner · 2024-04-25T20:07:22Z

Known issue: the max_length parameter in generate(prompt, max_length=64) does not work.

…_model

martin-gorner · 2024-04-29T14:26:05Z

I have changed the implementation to use the new new_model = clone_model(model, clone_function='lambda x:x', call_function=...) API instead of the previously suggested output = clone_layer_graph(input, output, clone_fn=...).

For this use case, i.e. rewiring a language model backbone with KV caches, the new API is a bit awkward, as it forces the user to use an intermediate model. In simplified code:

rewired_backbone = clone_model(backbone,
                               clone_function=lambda x:x, # no cloning
                               call_function=rewire_fn)

# Build a new backbone with caches in inputs and outputs.
input = {
    "token_ids": rewired_backbone.input["token_ids"],
    "cache": cache_input, # new input
    "cache_update_index": cache_update_index_input, # new input
}

# During the rewiring process, next_caches were collected, add them as a new output
next_cache = ops.stack(next_caches, axis=1)
output = (rewired_backbone.output, next_cache)

# create a new backbone that now uses caches in its forward pass
real_rewired_backbone = keras.Model(input, output, name=backbone.name + "_with_cache")
return real_rewired_backbone

The intermediate model rewired_backbone is "wrong" as it still has the original inputs of backbone, i.e. token_ids, and padding_mask, while its layer graph no longer uses padding_mask and now uses additional inputs cache_input and cache_update_index. The user has to create a new model real_rewired_backbone to fix those issues. It's also surprising that these graph connectedness issues were not caught when rewired_backbone was constructed. This code might fail in the future if graph connectedness checks are improved.

The previously suggested API did not have this awkwardness as it did not involve an intermediate Model. In simplified code:

# Build a new backbone with caches in inputs and outputs.
input = {
    "token_ids": backbone.input["token_ids"],
    "cache": cache_input, # new input
    "cache_update_index": cache_update_index_input, # new input
}

# This call can check for graph connectedness without failing
new_output = clone_layer_graph(input, backbone.output, clone_fn=rewire_fn)

# During the rewiring process, next_caches were collected, add them as a new output
next_cache = ops.stack(next_caches, axis=1)
output = (new_output, next_cache)

# create a new backbone that now uses caches in its forward pass
rewired_backbone = keras.Model(input, output, name=backbone.name + "_with_cache")
return rewired_backbone

Additional Note: I also noticed that the new API clones input tensors. For example, rewired_backbone.input["token_ids"] and backbone.input["token_ids"] are different tensors. The previously suggested API clone_layer_graph was keeping input layers identical as they do not need to be cloned. The new behavior might be surprising for users wondering why the token_ids input has changed during the re-wiring process and wether it's a bug.

mattdangerw

Thanks! Took a look!

Stopgap for now

Below shows a colab that would work today, and be decently readable (patching the call method of each transformer layer):

https://colab.research.google.com/gist/mattdangerw/72df1cbe743208bd69137be4f2142203/patched-call-control-vectors.ipynb

This is maybe not the most conceptually beautiful, but it's short, practical, and works for modifying attention and key/query/value projections. It's unclear to me if the approach on this PR could grow to include those types of surgery without substantive rewrites and breaking all weight compatibility.

I think for better or for worse, "layer patching" is the best cohesive approach we have for surgery today. It seems prudent to point users in this direction for now, as we think through new forms of model surgery, and work on a more flexible form of our current generation code.

Forward looking thoughts

What is on this PR is not general for all models (see comments below), and the general form might require pushing the rewiring to individual model implementations, to handle different layer types, input types, arg names, and overall structures (e.g. seq2seq).

I think it'd be doable in the technical sense, if a good bit more code. The rewiring code is a bit clunky and not super readable.

I'd be interested in scoping out support in Keras functional models for optional inputs. If we did this, we could write a backbone that supports caching out of the box. As well as other common "power user" inputs e.g. attention_mask, token_positions, that would cover a lot of other important CUJs without any need for cloning or surgery.

Optional inputs could allow a rewrite the our generative task code to treat the backbone as a functional black box, without any assumptions of internal layer structure. That could allow a lot of the types of functional surgeries you are interested in, with a much smaller blast radius on our model implementation code.

mattdangerw · 2024-04-30T02:41:06Z

keras_nlp/layers/modeling/transformer_decoder.py

@@ -503,3 +503,48 @@ def get_config(self):

    def compute_output_shape(self, decoder_sequence_shape):
        return decoder_sequence_shape
+
+    def compute_output_spec(


This is kinda clunky, but might be a good idea to add regardless. Can we just **kwargs the args we don't actually care about here?

I'm not sure if we need compute_ouptut_shape if we do this.

Yes, compute_output shape is probably not required when compute_output_spec is implemented. Here compute_output_spec was necessary because the layer returns differently shaped outputs depending on inputs.

mattdangerw · 2024-04-30T02:46:50Z

keras_nlp/models/causal_lm.py

+            return output
+
+        def rewire_fn(layer, *args, **kwargs):
+            if isinstance(layer, PositionEmbedding):


this would not work for a few models where the position embedding is part of a composite layer (TokenAndPositionEmbedding).

mattdangerw · 2024-04-30T02:47:14Z

keras_nlp/models/causal_lm.py

+        def rewire_fn(layer, *args, **kwargs):
+            if isinstance(layer, PositionEmbedding):
+                return rewire_positionembedding(layer, *args, **kwargs)
+            elif isinstance(layer, TransformerDecoder):


this would not work for most decoder models (as model decoder models write their own decoder block).

We can test for being in the set of backbone.transformer_blocks rather than a specific type. This can be solved with a convention of what "backbone" should contain (which makes sense - not any backbone works for cached text generation).

mattdangerw · 2024-04-30T02:54:52Z

keras_nlp/models/causal_lm.py

+    def _rewire_backbone_with_cache(backbone, cache_shape, cache_dtype):
+
+        # Define new inputs for caches.
+        cache_update_index_input = keras.Input(


Our cache might change to be a tuple of individual layer caches to avoid stacking/concating as described here. #1562

And further down the road, we might want to add more cache shape options, e.g. for things like token attention.

Interestingly, a cache of tuples would invalidate our current restriction on functional model inputs. We'd want a nested structure where one dictionary key contains a tuple of inputs, would break here https://github.com/keras-team/keras/blob/9f4da5159a098256dfbccd2c926107953a6812e5/keras/src/models/functional.py#L134-L141

So we may need to do more thinking here if we "unstack our cache".

Expanding functional models to arbitrary pytree inputs and outputs (as long as leaves are KerasTensors) is on the roadmap(look under "Modeling").

mattdangerw · 2024-04-30T02:55:59Z

keras_nlp/models/causal_lm.py

+        )
+        # cache_update_index_input is always a scalar. We must force the
+        # shape to scalar because keras.Input assumes a batch dim.
+        cache_update_index_input.shape = ()


This seems kinda hacky, is this something we want to support generally in Keras? Unbatched functional model inputs? And if so, is this the way we would like to do it?

mattdangerw · 2024-04-30T02:59:21Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

+        # === Backbone with cache ===
+        # The backbone with a cache is used in call_with_cache
+        cache_shape = self._compute_cache_shape(backbone, preprocessor)
+        self.backbone_with_cache = self._rewire_backbone_with_cache(


This might fail with Keras 2 saving for obscure reasons. Basically, we might try to save backbone_with_cache before the internal model layers, invalidating the whole checkpoint structure. (Just yet another reason to try to ditch Keras 2 asap).

mattdangerw · 2024-04-30T03:01:06Z

keras_nlp/models/gpt2/gpt2_causal_lm.py

+            {
+                "token_ids": token_ids,
+                "cache": cache,
+                "cache_update_index": ops.convert_to_tensor(cache_update_index),


Why do we need conver_to_tensor here?

Not sure why. It would not work without it.

mattdangerw · 2024-05-07T22:07:46Z

Update here... @fchollet has added support for optional functional inputs.

So what I think we can do is write a backbone that allows two optional inputs cache and index. Then we can write a causal lm that need zero knowledge of the internals of the backbone, just inputs and output during generation. So the entire "rewire" code can go away.

I think this is the right solution abstraction wise, and will allow a lot more aggressive model surgeries.

But landing this will still take some effort as we will need to drop Keras 2 codepaths in the library (Keras 2 will not support optional inputs).

martin-gorner added 2 commits April 25, 2024 18:17

Changed call_with_caches to respect the layer graph of the backbone. …

e8b5d44

…This uses the new experimental clone_layer_graph functionality.

simpler implementation: replaced ops.split + ops.squeeze with ops.uns…

e42f8bb

…tack

github-actions bot added the Gemma Gemma model specific issues label Apr 25, 2024

martin-gorner mentioned this pull request Apr 25, 2024

adding clone_layer_graph function to clone a graph of layers without cloning actual layers keras-team/keras#19600

Closed

martin-gorner and others added 2 commits April 29, 2024 13:53

Merge branch 'keras-team:master' into master

4546615

call_with_cache now implemented with a backbone rewired through clone…

f5ccde5

…_model

mattdangerw self-requested a review April 30, 2024 02:33

mattdangerw reviewed Apr 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 29, 2024

mattdangerw left a comment

mattdangerw Apr 30, 2024

martin-gorner Apr 30, 2024

mattdangerw Apr 30, 2024

mattdangerw Apr 30, 2024

martin-gorner Apr 30, 2024

mattdangerw Apr 30, 2024

martin-gorner Apr 30, 2024

mattdangerw Apr 30, 2024

mattdangerw Apr 30, 2024

martin-gorner Apr 30, 2024

mattdangerw Apr 30, 2024

martin-gorner Apr 30, 2024

mattdangerw commented May 7, 2024

[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

Are you sure you want to change the base?

[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

Conversation

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 25, 2024

martin-gorner commented Apr 29, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

Stopgap for now

Forward looking thoughts

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented May 7, 2024