Run the LLaMA and Mistral RMS Layer Norm in float32 #1532

tirthasheshpatel · 2024-03-29T15:56:17Z

LLaMA and Mistral Layer Norm should always run in float32. This PR corrects this bug in our implementation.

mattdangerw

Looks good! Style nits.

mattdangerw · 2024-03-29T16:52:31Z

keras_nlp/models/llama/llama_layernorm.py


    def build(self, input_shape):
-        self.weight = self.add_weight(
+        self._dim = input_shape[-1]


I don't think self._dim needs to be persisted anywhere right? Why not just dim = input_shape[-1]?

mattdangerw · 2024-03-29T16:53:27Z

keras_nlp/models/llama/llama_layernorm.py

    def __init__(self, epsilon=1e-6, **kwargs):
        super().__init__(**kwargs)
-        self.epsilon = epsilon
+        self._epsilon = epsilon


Try to keep these init args as public attrs. Update mistral instead.

mattdangerw · 2024-03-29T16:57:45Z

keras_nlp/models/llama/llama_layernorm.py

+    def call(self, x):
+        x = ops.cast(x, "float32")
+        x = x * ops.rsqrt(
+            ops.mean(ops.power(x, 2), axis=-1, keepdims=True) + self._epsilon


nit: use intermediate variables to keep things on one line and readable. e.g.

var = ops.mean(ops.power(x, 2), axis=-1, keepdims=True) x = x * ops.rsqrt(var + self.epsilon)

mattdangerw · 2024-03-29T16:59:56Z

keras_nlp/models/llama/llama_layernorm.py


    def build(self, input_shape):
-        self.weight = self.add_weight(
+        self._dim = input_shape[-1]
+        self._weight = self.add_weight(


Keep this public too. Update mistral instead.

Do our checkpoints still load fine if we call this scale? And name it scale? That's what Gemma does and it's a a better name.

Do our checkpoints still load fine if we call this scale? And name it scale?

I think Keras should work when the variable name in Python changes. AFAIK, Keras loads weights using the name field of the variable. So, changing that would break loading.

I believe it is actually order based. Did you try it out? https://github.com/keras-team/keras/blob/97b082dfee2552fcad1a7c7ea0fac9c72943360c/keras/layers/layer.py#L1187-L1188

It's not the end of the world. Just for things like the optimize call here we actually reference the scale name.

Oh cool. Thanks for the references. Will try to change and check if I can still load the preset!

I verified that weights load with the name change, so changes pushed. Let's see if the CI is also happy with it. Thanks!

- Change private variables to public vars - Change `self._weight` to `self.scale` - Don't persist the input dim - Move the var computation to its own line for readability

tirthasheshpatel · 2024-03-29T17:15:42Z

@mattdangerw Addressed the review comments. Let me know if the diff looks good to you now!

mattdangerw · 2024-03-29T17:17:06Z

Looks good besides that one potential name change. Thanks!

* Run the LLaMA RMS Layer Norm in float32 * Also use float32 in Mistral Layer Norm * Address review comments - Change private variables to public vars - Change `self._weight` to `self.scale` - Don't persist the input dim - Move the var computation to its own line for readability * Change weights to scale in layer norm

Run the LLaMA RMS Layer Norm in float32

23bc4a6

tirthasheshpatel requested a review from mattdangerw March 29, 2024 15:56

Also use float32 in Mistral Layer Norm

dd57cdd

tirthasheshpatel changed the title ~~Run the LLaMA RMS Layer Norm in float32~~ Run the LLaMA and Mistral RMS Layer Norm in float32 Mar 29, 2024

mattdangerw reviewed Mar 29, 2024

View reviewed changes

Address review comments

d5fa09d

- Change private variables to public vars - Change `self._weight` to `self.scale` - Don't persist the input dim - Move the var computation to its own line for readability

Change weights to scale in layer norm

93a9168

mattdangerw merged commit 78fdb2d into keras-team:master Mar 29, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the LLaMA and Mistral RMS Layer Norm in float32 #1532

Run the LLaMA and Mistral RMS Layer Norm in float32 #1532

tirthasheshpatel commented Mar 29, 2024 •

edited

Loading

mattdangerw left a comment

mattdangerw Mar 29, 2024

mattdangerw Mar 29, 2024

mattdangerw Mar 29, 2024

mattdangerw Mar 29, 2024

tirthasheshpatel Mar 29, 2024

mattdangerw Mar 29, 2024

tirthasheshpatel Mar 29, 2024

tirthasheshpatel Mar 29, 2024 •

edited

Loading

tirthasheshpatel commented Mar 29, 2024

mattdangerw commented Mar 29, 2024

Run the LLaMA and Mistral RMS Layer Norm in float32 #1532

Run the LLaMA and Mistral RMS Layer Norm in float32 #1532

Conversation

tirthasheshpatel commented Mar 29, 2024 • edited Loading

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Mar 29, 2024

Choose a reason for hiding this comment

mattdangerw Mar 29, 2024

Choose a reason for hiding this comment

mattdangerw Mar 29, 2024

Choose a reason for hiding this comment

mattdangerw Mar 29, 2024

Choose a reason for hiding this comment

tirthasheshpatel Mar 29, 2024

Choose a reason for hiding this comment

mattdangerw Mar 29, 2024

Choose a reason for hiding this comment

tirthasheshpatel Mar 29, 2024

Choose a reason for hiding this comment

tirthasheshpatel Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

tirthasheshpatel commented Mar 29, 2024

mattdangerw commented Mar 29, 2024

tirthasheshpatel commented Mar 29, 2024 •

edited

Loading

tirthasheshpatel Mar 29, 2024 •

edited

Loading