Always run the rotary embedding layer in float32 #1508

tirthasheshpatel · 2024-03-12T00:20:55Z

Follow-up for #1497

This PR refactors the keras_nlp.layers.modelling.rotary_embedding.RotaryEmbedding layer to always compute in float32 dtype since there are significant precision losses in other dtypes. Also update Gemma to use this layer instead of implementing its own version of RoPE.

This PR isn't ready yet. TODO:

Make sure the models (Gemma/Mistral) generates the same output with the presets.
Make sure the presets run in around 16GB RAM with bfloat16.
~~Add tests for the RoataryEmbedding layer to check no precision is lost with float16, bfloat16 dtypes.~~

Colab showing the equivalence of Gemma's embedding and the rotary embedding in KerasNLP: https://colab.research.google.com/drive/1BNNlxN7Y7yAzJl0UeWdG9TZ6RpfJjCBS?usp=sharing

keras_nlp/models/gemma/gemma_attention.py

mattdangerw

Code looks good!

We probably should test this to make sure numerics are as close to our reference jax implementation as they were before, and that this does not negatively impact performance.

mattdangerw · 2024-03-13T21:18:57Z

keras_nlp/models/gemma/gemma_attention.py

@@ -87,28 +88,20 @@ def build(self, inputs_shape):
            (None, None, self.num_query_heads, self.head_dim)
        )
        self.softmax = keras.layers.Softmax(dtype="float32")
+
+        self.rope_layer = RotaryEmbedding(
+            max_wavelength=10000.0, dtype=self.dtype_policy


nit, but let's start using the new 10_000.0 for large numbers, it's quite readable.

Sounds good, I love _ to separate large numbers!

mattdangerw · 2024-03-13T21:19:27Z

keras_nlp/models/gemma/gemma_attention.py

-        freq_exponents = (2.0 / x_shape[-1]) * ops.arange(
-            x_shape[-1] // 2, dtype="float32"
+        x = self.rope_layer(x, start_index=start_index)
+        x = ops.reshape(


Can we drop a comment explaining this?

mattdangerw · 2024-03-13T21:23:06Z

keras_nlp/layers/modeling/rotary_embedding.py


-        tensor = ops.cast(tensor, dtype=inverse_freq.dtype)
        freq = ops.einsum("i,j->ij", tensor, inverse_freq)
        embedding = ops.concatenate((freq, freq), axis=-1)


There is a weird bug with concatenate on the jax runtime. See the note below in the code you are removing.

# Avoid `ops.concatenate` for now, to avoid a obscure bug with XLA # compilation on jax. We should be able to remove this once the # following PR is in all jax releases we care about: # https://github.com/openxla/xla/pull/7875 output = ops.stack([x1 * cos - x2 * sin, x2 * cos + x1 * sin], axis=-1) return ops.reshape(output, x_shape)

This may be fixed on recent version of jax, but we should probably check that this is not an issue with the current version of jax on colab and kaggle with a GPU runtime. We might need to persist this fix.

Checked inference, works with the JAX version on Colab with A100

RotaryEmbedding has always used concatenate and it worked even before the fix was shipped in JAX. (for e.g. Mistral worked). I am not sure if this bug is an obscure TPU thing. Might be good to confirm with the original author.

Yeah the concatenate issue was funky. It should be only with jax 0.4.23, not 0.4.24. And I believe it was only showing up at certain shapes (e.g. batch size 2 not batch size 1), and only during fine-tuning.

To be safe maybe let's propagate the fix for now, and ditch after 0.4.24 is on colab.

tirthasheshpatel · 2024-03-13T21:35:04Z

Code looks good!

We probably should test this to make sure numerics are as close to our reference jax implementation as they were before, and that this does not negatively impact performance.

Already done here: https://colab.research.google.com/drive/1BNNlxN7Y7yAzJl0UeWdG9TZ6RpfJjCBS?usp=sharing

mattdangerw

lgtm! thank you!

* Always run the rotary embedding layer in float32 * Fix the int32 issue with TensorFlow * Only run sin/cos embedding compute step in float32 * Avoid start_index from downcasting automatically * Use stack instrad of concatenate

Always run the rotary embedding layer in float32

bd24a2b

tirthasheshpatel requested a review from mattdangerw March 12, 2024 00:20

github-actions bot added the Gemma Gemma model specific issues label Mar 12, 2024

tirthasheshpatel commented Mar 12, 2024

View reviewed changes

keras_nlp/models/gemma/gemma_attention.py Outdated Show resolved Hide resolved

tirthasheshpatel commented Mar 12, 2024

View reviewed changes

keras_nlp/models/gemma/gemma_attention.py Outdated Show resolved Hide resolved

tirthasheshpatel added 4 commits March 12, 2024 20:14

Fix the int32 issue with TensorFlow

8149676

Only run sin/cos embedding compute step in float32

1ee84b8

Merge branch 'master' of github.com:keras-team/keras-nlp into gemma-rope

632dabe

Avoid start_index from downcasting automatically

532695d

tirthasheshpatel marked this pull request as ready for review March 13, 2024 18:57

mattdangerw reviewed Mar 13, 2024

View reviewed changes

Use stack instrad of concatenate

9c0b967

mattdangerw approved these changes Mar 13, 2024

View reviewed changes

mattdangerw merged commit f127901 into keras-team:master Mar 14, 2024
10 checks passed

grasskin mentioned this pull request Mar 21, 2024

Gemma discrepancies #1494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always run the rotary embedding layer in float32 #1508

Always run the rotary embedding layer in float32 #1508

tirthasheshpatel commented Mar 12, 2024 •

edited

Loading

mattdangerw left a comment

mattdangerw Mar 13, 2024

tirthasheshpatel Mar 13, 2024

mattdangerw Mar 13, 2024

mattdangerw Mar 13, 2024

tirthasheshpatel Mar 13, 2024 •

edited

Loading

tirthasheshpatel Mar 13, 2024

mattdangerw Mar 13, 2024

tirthasheshpatel commented Mar 13, 2024

mattdangerw left a comment

Always run the rotary embedding layer in float32 #1508

Always run the rotary embedding layer in float32 #1508

Conversation

tirthasheshpatel commented Mar 12, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Mar 13, 2024

Choose a reason for hiding this comment

tirthasheshpatel Mar 13, 2024

Choose a reason for hiding this comment

mattdangerw Mar 13, 2024

Choose a reason for hiding this comment

mattdangerw Mar 13, 2024

Choose a reason for hiding this comment

tirthasheshpatel Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

tirthasheshpatel Mar 13, 2024

Choose a reason for hiding this comment

mattdangerw Mar 13, 2024

Choose a reason for hiding this comment

tirthasheshpatel commented Mar 13, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

tirthasheshpatel commented Mar 12, 2024 •

edited

Loading

tirthasheshpatel Mar 13, 2024 •

edited

Loading