Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

danielhanchen · 2024-03-02T05:32:41Z

Just a few more Gemma fixes :) Currently checking for more as well! All fixes are derived from Unsloth's 2.5x faster and 70% less VRAM Gemma finetuning script :) https://github.com/unslothai/unsloth

Related PR: #29285, which showed RoPE must be done in float32 and not float16, causing positional encodings to lose accuracy. @ArthurZucker @younesbelkada

1. Approx Gelu and not Exact

Activation function according to https://twitter.com/danielhanchen/status/1763613620909580505 (waiting for confirmation) should be approximate gelu and not exact gelu. Ie gelu should actually be gelu_pytorch_tanh which calls PytorchGELUTanh and that calls nn.functional.gelu(input, approximate="tanh"). gelu calls nn.functional.gelu(input, approximate="none")

2. Layernorm (w+1) should be done in float32

The layernorms must be upcasted to float32 and not bfloat16 or float16 halfway. It must be done unlike Llama’s RMS Layernorm which downcasted before multiplying by the weights. We must downcast at the end.

3. sqrt(3072)=55.4256 but bfloat16 is 55.5

Interestingly, Gemma multiplies the embeddings by sqrt(hidden_dim). However, there is a precision problem! Gemma uses jnp.sqrt(self.embed_dim) .astype(x.dtype) which means sqrt(3072) = 55.4256, but casting it to bfloat16 rounds it to 55.5. For Gemma 2b, sqrt(2048) = 45.2548, but casting it to bfloat16 makes it 45.25.

danielhanchen · 2024-03-02T06:20:48Z

Actually I just noticed this requires updating all of Gemma's models on the HF Model Hub: https://huggingface.co/google/gemma-7b/blob/main/config.json for eg gelu should be changed to gelu_pytorch_tanh.

ArthurZucker · 2024-03-04T11:12:40Z

src/transformers/models/gemma/modeling_gemma.py

+        if hidden_act != "gelu_pytorch_tanh":
+            logger.warning_once(
+                "Gemma's activation function should be approximate GeLU and not exact GeLU.\n"\
+                "Please edit your model config to use `gelu_pytorch_tanh` and not `gelu`.\n"\
+                "For now, we shall use `gelu_pytorch_tanh` temporarily.\n"\
+                "See https://github.com/huggingface/transformers/pull/29402 for more details."
+            )
+            hidden_act = "gelu_pytorch_tanh"
+        self.act_fn = ACT2FN[hidden_act]


I don't mind automatically switching, but it's best if the users still have a way to use the legacy gelu! Either a big warning or use another config name

So we need a self.hidden_activation set to None by default and if None warn that we will use the new approx else use what was give

Oh ok good point! Sorry didn't work on this in the meantime - I found a few more issues, and will push them here tomorrow :)

younesbelkada

Thanks for working on this @danielhanchen ! 🤩
I left one comment about backward compatibility, what do you think?

younesbelkada · 2024-03-11T15:59:26Z

src/transformers/models/gemma/modeling_gemma.py

@@ -170,7 +173,16 @@ def __init__(self, config):
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
-        self.act_fn = ACT2FN[config.hidden_act]
+        hidden_act = config.hidden_act
+        if hidden_act != "gelu_pytorch_tanh":


This way even if there is a model with the old gelu on the config we're force-setting hidden_act to "gelu_pytorch_tanh" right?
I think we should either use a new config name or create a new attribute in the config force_use_exact_gelu, that is iniailizaed to False so that users can have the flexibility to switch to the old act function in case they fine-tuned it with old GeLU, what do you think?

Hmm I like that approach ie getattr(config, "force_use_exact_gelu", False) so if force_use_exact_gelu = True then True. If force_use_exact_gelu = False then also False, and False otherwise.

yes! I think we can add that directly into GemmaConfig class and default it to False

I added force_use_exact_gelu!

HuggingFaceDocBuilderDev · 2024-03-14T17:46:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM ! Let's fix the CI and merge!

src/transformers/models/gemma/modeling_gemma.py

ArthurZucker · 2024-03-19T09:38:41Z

src/transformers/models/gemma/modeling_gemma.py

-        hidden_states = hidden_states * (self.config.hidden_size**0.5)
+        # Gemma downcasts the below to float16, causing sqrt(3072)=55.4256 to become 55.5
+        # See https://github.com/huggingface/transformers/pull/29402
+        normalizer = torch.tensor(self.config.hidden_size**0.5, dtype = hidden_states.dtype)


Should not make a difference but it should be cleaner to cache this one as hidden_states_scale ?

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

danielhanchen · 2024-03-19T17:20:03Z

Closing this since we merged this! Thanks everyone!

) This PR adds `force_downcast_after` to `FastRMSNorm.forward` which is used in the Gemma model. References huggingface/transformers#29402 and huggingface/transformers#29729 Setting `force_downcast_after=True` will perform the `hidden_states * weight` multiplication in f32 and then downcast to half. This differs slightly from the current implementation which first casts the `hidden_states` to a half and then multiples.

nxphi47 · 2024-03-22T13:48:23Z

@danielhanchen Many thanks for the fixes. Do you observe any performance difference before and after this fix? thanks.

fix precision issue mentioned in huggingface/transformers#29402 this diff: * fixed 1> Approx Gelu and 3> sqrt(hidden_dim) with dtype * fixed the head_dim for gemma_7b_* models

…658) This PR adds `force_downcast_after` to `FastRMSNorm.forward` which is used in the Gemma model. References huggingface/transformers#29402 and huggingface/transformers#29729 Setting `force_downcast_after=True` will perform the `hidden_states * weight` multiplication in f32 and then downcast to half. This differs slightly from the current implementation which first casts the `hidden_states` to a half and then multiples.

…ggingface#1658) This PR adds `force_downcast_after` to `FastRMSNorm.forward` which is used in the Gemma model. References huggingface/transformers#29402 and huggingface/transformers#29729 Setting `force_downcast_after=True` will perform the `hidden_states * weight` multiplication in f32 and then downcast to half. This differs slightly from the current implementation which first casts the `hidden_states` to a half and then multiples.

gelu_pytorch_tanh

1464520

Force config.hidden_act to be approx gelu

606463f

ArthurZucker reviewed Mar 5, 2024

View reviewed changes

danielhanchen added 2 commits March 10, 2024 03:52

Merge branch 'huggingface:main' into main

16ed142

Gemma bug fixes

03139e6

danielhanchen changed the title ~~Gemma fixes - gelu~~ Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) Mar 9, 2024

younesbelkada reviewed Mar 11, 2024

View reviewed changes

Merge branch 'huggingface:main' into main

ca3cae3

WoosukKwon mentioned this pull request Mar 12, 2024

Add kernel for GeGLU with approximate GELU vllm-project/vllm#3337

Merged

jonatanklosko mentioned this pull request Mar 12, 2024

Gemma fixes elixir-nx/bumblebee#362

Merged

danielhanchen added 3 commits March 15, 2024 04:18

Merge branch 'huggingface:main' into main

73c24f6

force_use_exact_gelu

2b8c7f1

Update configuration_gemma.py

c1b8bef

danielhanchen added 2 commits March 15, 2024 19:36

Update modeling_gemma.py

32656cc

Merge branch 'huggingface:main' into main

9aa08bf

ArthurZucker approved these changes Mar 19, 2024

View reviewed changes

ArthurZucker mentioned this pull request Mar 19, 2024

[Gemma] final fixes to the modeling #29729

Merged

danielhanchen and others added 3 commits March 19, 2024 22:25

Update src/transformers/models/gemma/modeling_gemma.py

8cdc615

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'huggingface:main' into main

cd6f5f4

gemma_normalizer

62bd01f

danielhanchen closed this Mar 19, 2024

drbh mentioned this pull request Mar 20, 2024

feat: support force downcast after FastRMSNorm multiply for Gemma huggingface/text-generation-inference#1658

Merged

WoosukKwon mentioned this pull request Mar 27, 2024

[Bugfix] More faithful implementation of Gemma vllm-project/vllm#3653

Merged

HeegyuKim mentioned this pull request Mar 31, 2024

Gemma Approx GELU Issue Beomi/Gemma-EasyLM#1

Closed

pcuenca mentioned this pull request Apr 3, 2024

Gemma: update activation warning #29995

Merged

5 tasks

This was referenced Apr 15, 2024

[feat] added rms norm residual kernel vectorch-ai/ScaleLLM#125

Merged

[fix] fix data accuracy issue for gemma vectorch-ai/ScaleLLM#126

Merged

cuichenx mentioned this pull request Apr 17, 2024

Gemma bug NVIDIA/NeMo#8962

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

danielhanchen commented Mar 2, 2024 •

edited

danielhanchen commented Mar 2, 2024

ArthurZucker Mar 4, 2024

ArthurZucker Mar 5, 2024

danielhanchen Mar 5, 2024

younesbelkada left a comment

younesbelkada Mar 11, 2024

danielhanchen Mar 11, 2024

younesbelkada Mar 11, 2024

danielhanchen Mar 14, 2024

HuggingFaceDocBuilderDev commented Mar 14, 2024

ArthurZucker left a comment

ArthurZucker Mar 19, 2024

danielhanchen Mar 19, 2024

danielhanchen commented Mar 19, 2024

nxphi47 commented Mar 22, 2024

Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

Conversation

danielhanchen commented Mar 2, 2024 • edited

1. Approx Gelu and not Exact

2. Layernorm (w+1) should be done in float32

3. sqrt(3072)=55.4256 but bfloat16 is 55.5

danielhanchen commented Mar 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 14, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielhanchen commented Mar 19, 2024

nxphi47 commented Mar 22, 2024

danielhanchen commented Mar 2, 2024 •

edited