Keep rope at float32 precision #1497

grasskin · 2024-03-07T19:32:23Z

No description provided.

tirthasheshpatel

I think we also need to cast back to the compute_dtype before returning. Only the computation part needs to happen in float32.

tirthasheshpatel · 2024-03-07T23:18:50Z

keras_nlp/models/gemma/gemma_attention.py

@@ -95,7 +95,7 @@ def _apply_rope(self, x, positions):
        max_wavelength = 10000
        x_shape = ops.shape(x)
        freq_exponents = (2.0 / x_shape[-1]) * ops.cast(
-            ops.arange(x_shape[-1] // 2, dtype="float32"), self.compute_dtype
+            ops.arange(x_shape[-1] // 2, dtype="float32"), "float32"


We can remove the ops.cast call here; everything should be float32.

tirthasheshpatel · 2024-03-07T23:22:34Z

@mattdangerw Gemma still downcasts the tensors to compute_dtype since it uses it's own implementation of RoPE. I can submit a follow-up PR to use this layer instead.

mattdangerw · 2024-03-08T00:59:42Z

I think we also need to cast back to the compute_dtype before returning. Only the computation part needs to happen in float32.

Yeah looks like this is causing test failures, probably due to this issue?

danielhanchen · 2024-03-09T09:41:16Z

Hi :) I'm assuming this came about from my Twitter thread https://twitter.com/danielhanchen/status/1765446273661075609 :)

I added a fix into transformers 4.38.2 here: huggingface/transformers#29285. So using mixed_bfloat16 causes torch.autocast to cast all ops to bfloat16. I don't normally use Keras, so unsure if torch.autocast affects operations, since I know even explicitly forcing float32 causes autocast to override it. However, unsure on Keras.

Also another problematic line is https://github.com/keras-team/keras-nlp/blob/v0.8.2/keras_nlp/models/gemma/gemma_attention.py#L159

        seq_len = ops.shape(x)[1]
        start_index = cache_update_index
        positions = ops.cast(
            ops.arange(seq_len, dtype="float32"), >>>>> self.compute_dtype <<<<
        )
        positions = positions + ops.cast(start_index, self.compute_dtype)

Which is wrong - Assume if someone did RoPE Scaling with float16 - this will cause 65504 to be the maximum, which in turn causes overflow ie infinities to occur. bfloat16 loses precision, but can represent larger numbers.

grasskin · 2024-03-12T20:09:45Z

Hi @danielhanchen, enjoyed reading the blogpost it was a great in depth dive!

Switched all of RoPE to happen in "float32" and added downcasting before returning. This likely works until we replace the call with normal Keras RoPE?

tirthasheshpatel · 2024-03-12T20:11:00Z

Switched all of RoPE to happen in "float32" and added downcasting before returning. This likely works until we replace the call with normal Keras RoPE?

Yeah, this should be good enough for now. We can merge this and I can rebase my PR on top of your changes.

mattdangerw · 2024-03-12T20:35:40Z

keras_nlp/models/gemma/gemma_attention.py

@@ -92,10 +92,13 @@ def build(self, inputs_shape):
    def _apply_rope(self, x, positions):
        """Rope rotate q or k."""
        # TODO: refactor to use RotaryEmbedding layer?
+        x = ops.cast(


do we really want x in float32? that feels a little awkward, given that we are going to downcast immediately. seems reasonable to keep the sin and cos line at full precision (though i have no idea if that's important or not).

but this line seems like it should probably be done in compute_dtype => ops.stack([x1 * cos - x2 * sin, x2 * cos + x1 * sin], axis=-1)

so

sin = ops.cast(ops.sin(radians), self.compute_dtype) cos = ops.cast(ops.cos(radians), self.compute_dtype) x1, x2 = ops.split(x, 2, axis=-1) return ops.stack([x1 * cos - x2 * sin, x2 * cos + x1 * sin], axis=-1)

incidentally, that would mean that if we ever did cache the rotary vectors, that would mean we are caching them at float32 and applying them at the correct dtype for the model

wdyt?

I agree, casting x is a bit awkward. I guess as long as radians/positions/timescale are in float32 initially we loose the numerical inacuracy.

Dug more into the other submited fix for sin/cos in float32 huggingface/transformers#29285 (comment) - and it seems like this is the right call here (downcasting before we stack/multiply).

Sweet! This LGTM! Will pull it in.

* Keep rope at float32 precision * Carry out all of RoPE in float32 * Formatting * Cleanup * Do not cast x

Keep rope at float32 precision

c0c518c

grasskin requested a review from mattdangerw March 7, 2024 19:32

tirthasheshpatel reviewed Mar 7, 2024

View reviewed changes

tirthasheshpatel mentioned this pull request Mar 12, 2024

Always run the rotary embedding layer in float32 #1508

Merged

3 tasks

grasskin added 2 commits March 12, 2024 20:06

Carry out all of RoPE in float32

90cca66

Formatting

50f26cb

Cleanup

854d7b4

mattdangerw reviewed Mar 12, 2024

View reviewed changes

Do not cast x

3d40750

mattdangerw approved these changes Mar 13, 2024

View reviewed changes

mattdangerw merged commit 09d2fdd into keras-team:master Mar 13, 2024
10 checks passed

grasskin mentioned this pull request Mar 21, 2024

Gemma discrepancies #1494

Closed

abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024

Keep rope at float32 precision (keras-team#1497)

81dd7b5

* Keep rope at float32 precision * Carry out all of RoPE in float32 * Formatting * Cleanup * Do not cast x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep rope at float32 precision #1497

Keep rope at float32 precision #1497

grasskin commented Mar 7, 2024

tirthasheshpatel left a comment

tirthasheshpatel Mar 7, 2024

tirthasheshpatel commented Mar 7, 2024 •

edited

Loading

mattdangerw commented Mar 8, 2024 •

edited

Loading

danielhanchen commented Mar 9, 2024 •

edited

Loading

grasskin commented Mar 12, 2024

tirthasheshpatel commented Mar 12, 2024

mattdangerw Mar 12, 2024

grasskin Mar 13, 2024

mattdangerw Mar 13, 2024

Keep rope at float32 precision #1497

Keep rope at float32 precision #1497

Conversation

grasskin commented Mar 7, 2024

tirthasheshpatel left a comment

Choose a reason for hiding this comment

tirthasheshpatel Mar 7, 2024

Choose a reason for hiding this comment

tirthasheshpatel commented Mar 7, 2024 • edited Loading

mattdangerw commented Mar 8, 2024 • edited Loading

danielhanchen commented Mar 9, 2024 • edited Loading

grasskin commented Mar 12, 2024

tirthasheshpatel commented Mar 12, 2024

mattdangerw Mar 12, 2024

Choose a reason for hiding this comment

grasskin Mar 13, 2024

Choose a reason for hiding this comment

mattdangerw Mar 13, 2024

Choose a reason for hiding this comment

tirthasheshpatel commented Mar 7, 2024 •

edited

Loading

mattdangerw commented Mar 8, 2024 •

edited

Loading

danielhanchen commented Mar 9, 2024 •

edited

Loading