RoPE loses precision for Llama / Gemma + Gemma logits.float() #29285

danielhanchen · 2024-02-26T06:21:58Z

When I was implementing Gemma for Unsloth, I noticed when one uses bfloat16, the RoPE embeddings get autocast to bfloat16, when we require it to be in float32. This causes the positional encodings to lose precision dramatically especially for very large context lengths.

Below I pasted the image on how HF for now handles RoPE. You can see the loss in precision when using bfloat16. I manually autocasted it to float32 in Unsloth, and you can see the expected positional encodings.

I couldn't find why Unsloth's error could not match that of HF's original Gemma implementation. On float16, this issue does not occur, with HF and Unsloth's training loss curve being equivalent:

However when I switched over to bfloat16, HF and Unsloth's training losses diverge at the start, and Unsloth always retains a lower loss as training goes on:

If you look at the losses more carefully (same seed), you can see the differences more closely.

The culprit I found was

inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)

where if one uses torch.autocast(), freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2) gets done in bfloat16 and not float32. I propose we turn off autocast to force float32. Ie:

with torch.autocast(device_type=position_ids_expanded.device.type, enabled=False):
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)

This ensures torch.autocast to turn off automatic downcasting to float16 / bfloat16 for the RoPE embeddings. My proposed fix shows the following loss curve:

Also, in Gemma, a 1 liner was missed :) logits = logits.float() must be placed to upcast the logits to float32. Although it should be done automatically in torch.autocast, it's best to keep the convention as done in llama, mistral and other models. Gemma's implementation seems to maybe have forgotten this 1 line :)

Llama - Force float32 since bfloat16 loses precision on long contexts

Fix RoPE and logits.float()

danielhanchen · 2024-02-26T06:22:57Z

Forgot to add I'm not certain if this will break CUDAGraphs for faster inference - hopefully not

ArthurZucker

I'll have to check the compile test and everything, but we usually hate these kind of changes 🫣 the bug is real, I'll see if I can find a good alternative as this is pretty much only for training! Great catch 🤗

src/transformers/models/llama/modeling_llama.py

danielhanchen · 2024-02-26T08:07:24Z

Sadly unsure if it's just for training :(( For inference I don't remember up to which context length, bfloat16 won't be an issue. I think it was up to 4096. However, bfloat16 loses precision even for inference sadly after 4096 context lengths. 8192 definitely - bfloat16 essentially thinks the last 4 tokens are all position 8192 ie [8192, 8192, 8192, 8192], whilst the correct float32 is [8188, 8189, 8190, 8191].

src/transformers/models/gemma/modeling_gemma.py

ArthurZucker

LGTM let's do no grad and autocast, I'll test compile once you have both!

ArthurZucker

LGTM, before merging I'll ping @pacman100, @younesbelkada and @fxmarty as this is pretty important! Feel free to comment if you are against these changes!

ArthurZucker · 2024-02-27T09:23:28Z

src/transformers/models/gemma/modeling_gemma.py

    def forward(self, x, position_ids, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if self.inv_freq is None:
            self.inv_freq = 1.0 / (
                self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
            )
-
+        


Suggested change

HuggingFaceDocBuilderDev · 2024-02-27T09:26:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/gemma/modeling_gemma.py

gante · 2024-02-27T11:26:15Z

@danielhanchen .sin() and .cos() should ideally happen in FP32 as well. Have you noticed any performance changes if you force them to happen in FP32?

danielhanchen · 2024-02-27T15:15:03Z

@gante Actually interesting point - I can see torch.autocast does arc sin and sinh etc in float32, but it doesnt list sin itself - I'll have to check if .sin() is done in float32 or float16

danielhanchen · 2024-02-28T06:24:03Z

@ArthurZucker I checked everything and it's working! You guys can double check if anything is wrong. You can push the commit whenever. Thank you! :)

ArthurZucker

LGTM! Thanks a mile for this.
Let's make sure you run make style and make fixup for the last CIs

src/transformers/models/gemma/modeling_gemma.py

src/transformers/models/llama/modeling_llama.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

This reverts commit b860a22.

ArthurZucker · 2024-02-28T13:51:13Z

        self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
        self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)

I doubt that this is the cause of the issue since we used to do that before alrady

ArthurZucker · 2024-02-28T14:16:49Z

Thanks @danielhanchen 🤗 merging now!

* Update modeling_llama.py Llama - Force float32 since bfloat16 loses precision on long contexts * Update modeling_llama.py * Update modeling_gemma.py Fix RoPE and logits.float() * @torch.no_grad() * @torch.no_grad() * Cos, Sin to float32 * cos, sin to float32 * Update src/transformers/models/gemma/modeling_gemma.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Resolve PR conflicts * Fix RoPE for llama * Revert "Fix RoPE for llama" This reverts commit b860a22. * Fix RoPE for llama * RoPE device * Autocast device type * RoPE * RoPE isinstance --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

danielhanchen · 2024-02-28T14:57:13Z

Yay! :) Great work everyone :)

suryabhupa · 2024-02-29T21:45:10Z

(hello from the Gemma team!) Superb stuff, thanks for this lovely fix :)

* Update modeling_llama.py Llama - Force float32 since bfloat16 loses precision on long contexts * Update modeling_llama.py * Update modeling_gemma.py Fix RoPE and logits.float() * @torch.no_grad() * @torch.no_grad() * Cos, Sin to float32 * cos, sin to float32 * Update src/transformers/models/gemma/modeling_gemma.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Resolve PR conflicts * Fix RoPE for llama * Revert "Fix RoPE for llama" This reverts commit b860a22. * Fix RoPE for llama * RoPE device * Autocast device type * RoPE * RoPE isinstance --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

danielhanchen · 2024-03-01T16:56:58Z

@suryabhupa Thanks :)
Actually was gonna ask do you know if Gemma uses approximate gelu or exact gelu? When comparing Keras to HF, torch.dist gets 4.7943, while tanh approx gets 0.0057

paulcx · 2024-03-06T00:48:18Z

Did anyone realize that this fix would incur additional VRAM increase likely making it OOM for the code that could have been trained?

danielhanchen · 2024-03-06T02:33:42Z

@paulcx Yes unfortunately that was a consideration we took - the issue is results become incorrect especially on longer context lengths - shorter is fine. We tried our best to isolate changes - for inference this is fine, but for training there will be more VRAM usage. There's always Unsloth for Gemma of course, where we allow 2.5x faster finetuning and 60% VRAM reductions :) And TRL, PEFT all HF libraries are all integrated. Gemma 7b notebook: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing

ArthurZucker · 2024-03-06T02:42:09Z

And though this increases VRAM usage, precision and quality of the outputs should be improved. This is also more "accurate" with respect to the original implementations

paulcx · 2024-03-06T03:30:02Z

@danielhanchen Thank you for your finding. It would be exciting if it could be modified on top of the existing one without at least increasing the VRAM. At least, some resource-strained tranning may not be able to train now due to this VRAM (even if the batch size is set to 1).

ArthurZucker · 2024-03-06T03:38:22Z

How much of a VRAM increase are we talking about?

paulcx · 2024-03-06T05:07:22Z

It's difficult to give an exact number, but the growth in VRAM requirements could depend on the size of the model, for instance, for a 34B model, it could be in the tens of GBs？

danielhanchen · 2024-03-06T05:44:44Z

@paulcx Wait 10s of GBs? The PR is only for Llama and Gemma, so I'm assuming CodeLlama 34b is using more VRAM? The RoPE upcasting should only use 8192 * 8192 * 16 bits approx of extra VRAM * n layers so say 32 layers = 128MB of extra VRAM per layer, and it should be cleared away since we don't need the 8192x8192 matrix. I don't see how an extra 10GB of VRAM is coming from - are you finetuning Gemma or Llama?

I can see why Gemma might be using more VRAM - since logits.float() is necessary otherwise the softmax will use torch.float16 which causes incorrect results when training over long periods of time.

For Llama, only at max 128MB of VRAM should be used extra

paulcx · 2024-03-06T06:22:00Z

@danielhanchen I must say that my estimate is likely to be quite imprecise given that I have not made accurate experiments and statistics. The only evidence I have to support GBs is from my own experience with a 34B CodeLlama (which is trained successfully on transformers==4.37.2). I know it could be caused by many reasons, for example, multi-GPU training with deepspeed ... etc.

danielhanchen · 2024-03-06T07:04:24Z

@paulcx OHH this is a different issue!! It seems like a change was made in another PR which allocates a causal mask of size (16384, 16384) https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L940

The triu causes the causal mask to upcast to float32, using 16384^2 * 4bytes = 1GB of extra VRAM. We have n^2 * 4 / 1024 / 1024 = 37.25GB in your screenshot, so I'm assuming you're also doing RoPE Scaling to 100K context length? So ie a (100K, 100K) matrix was trying to be created.

So this looks like a separate problem than this PR unfortunately.

paulcx · 2024-03-06T08:24:55Z

Thanks @danielhanchen

gante · 2024-03-06T14:48:28Z

@paulcx your issue is related to this one (#29484) -- let's keep the discussion there! :)

* Update modeling_llama.py Llama - Force float32 since bfloat16 loses precision on long contexts * Update modeling_llama.py * Update modeling_gemma.py Fix RoPE and logits.float() * @torch.no_grad() * @torch.no_grad() * Cos, Sin to float32 * cos, sin to float32 * Update src/transformers/models/gemma/modeling_gemma.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Resolve PR conflicts * Fix RoPE for llama * Revert "Fix RoPE for llama" This reverts commit b860a22. * Fix RoPE for llama * RoPE device * Autocast device type * RoPE * RoPE isinstance --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

danielhanchen added 3 commits February 26, 2024 17:04

Update modeling_llama.py

7a25720

Llama - Force float32 since bfloat16 loses precision on long contexts

Update modeling_llama.py

db8237f

Update modeling_gemma.py

3de95c4

Fix RoPE and logits.float()

ArthurZucker reviewed Feb 26, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Feb 27, 2024

torch.arange use should not use dtype=float for integer ranges, conflicts w/ DS zero.Init() #28685

Closed

4 tasks

ArthurZucker reviewed Feb 27, 2024

View reviewed changes

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Feb 27, 2024

View reviewed changes

danielhanchen added 3 commits February 27, 2024 19:56

Merge branch 'huggingface:main' into main

9e5cbb0

@torch.no_grad()

99d564e

@torch.no_grad()

d0c08bf

ArthurZucker approved these changes Feb 27, 2024

View reviewed changes

fxmarty reviewed Feb 27, 2024

View reviewed changes

gante mentioned this pull request Feb 27, 2024

LLaMA RoPE precision with bf16 model #29301

Closed

4 tasks

ArthurZucker mentioned this pull request Feb 28, 2024

gemma-7b-it聊天失败 hiyouga/LLaMA-Factory#2540

Closed

danielhanchen added 3 commits February 28, 2024 16:17

Merge branch 'huggingface:main' into main

bd3a214

Cos, Sin to float32

abffebb

cos, sin to float32

c2e31bf

ArthurZucker approved these changes Feb 28, 2024

View reviewed changes

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

src/transformers/models/llama/modeling_llama.py Outdated Show resolved Hide resolved

danielhanchen and others added 6 commits February 28, 2024 21:30

Update src/transformers/models/gemma/modeling_gemma.py

f487800

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/models/llama/modeling_llama.py

c852675

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Resolve PR conflicts

1a50a4b

Fix RoPE for llama

b860a22

Revert "Fix RoPE for llama"

790e4a3

This reverts commit b860a22.

Merge remote-tracking branch 'upstream/main'

06c7634

danielhanchen added 2 commits February 29, 2024 00:51

RoPE

ae9957f

RoPE isinstance

ec9ef17

ArthurZucker merged commit d3a4b47 into huggingface:main Feb 28, 2024
18 checks passed

This was referenced Mar 2, 2024

Gemma bug fixes - Approx GELU, Layernorms, Sqrt(hd) #29402

Closed

Gemma fixes - gelu google/gemma_pytorch#37

Merged

currybab mentioned this pull request Mar 4, 2024

Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device #29439

Merged

5 tasks

paulcx mentioned this pull request Mar 6, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

gante mentioned this pull request Mar 6, 2024

Upgrade from 4.37.2 to 4.38.2 causes CUDA out of memory error with identical configuration. #29484

Closed

4 tasks

danielhanchen mentioned this pull request Mar 9, 2024

Keep rope at float32 precision keras-team/keras-nlp#1497

Merged

ArthurZucker mentioned this pull request Apr 5, 2024

Move instantiation of RoPE from MistralAttention to MistralModel #30072

Closed

5 tasks

EricLBuehler mentioned this pull request Apr 29, 2024

Sliding window for phi3 EricLBuehler/mistral.rs#244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE loses precision for Llama / Gemma + Gemma logits.float() #29285

RoPE loses precision for Llama / Gemma + Gemma logits.float() #29285

danielhanchen commented Feb 26, 2024

danielhanchen commented Feb 26, 2024

ArthurZucker left a comment

danielhanchen commented Feb 26, 2024

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Feb 27, 2024

HuggingFaceDocBuilderDev commented Feb 27, 2024

gante commented Feb 27, 2024

danielhanchen commented Feb 27, 2024

danielhanchen commented Feb 28, 2024

ArthurZucker left a comment

ArthurZucker commented Feb 28, 2024

ArthurZucker commented Feb 28, 2024

danielhanchen commented Feb 28, 2024

suryabhupa commented Feb 29, 2024

danielhanchen commented Mar 1, 2024

paulcx commented Mar 6, 2024

danielhanchen commented Mar 6, 2024

ArthurZucker commented Mar 6, 2024

paulcx commented Mar 6, 2024

ArthurZucker commented Mar 6, 2024

paulcx commented Mar 6, 2024

danielhanchen commented Mar 6, 2024 •

edited

paulcx commented Mar 6, 2024 •

edited

danielhanchen commented Mar 6, 2024 •

edited

paulcx commented Mar 6, 2024 •

edited

gante commented Mar 6, 2024

RoPE loses precision for Llama / Gemma + Gemma logits.float() #29285

RoPE loses precision for Llama / Gemma + Gemma logits.float() #29285

Conversation

danielhanchen commented Feb 26, 2024

danielhanchen commented Feb 26, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

danielhanchen commented Feb 26, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 27, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 27, 2024

gante commented Feb 27, 2024

danielhanchen commented Feb 27, 2024

danielhanchen commented Feb 28, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Feb 28, 2024

ArthurZucker commented Feb 28, 2024

danielhanchen commented Feb 28, 2024

suryabhupa commented Feb 29, 2024

danielhanchen commented Mar 1, 2024

paulcx commented Mar 6, 2024

danielhanchen commented Mar 6, 2024

ArthurZucker commented Mar 6, 2024

paulcx commented Mar 6, 2024

ArthurZucker commented Mar 6, 2024

paulcx commented Mar 6, 2024

danielhanchen commented Mar 6, 2024 • edited

paulcx commented Mar 6, 2024 • edited

danielhanchen commented Mar 6, 2024 • edited

paulcx commented Mar 6, 2024 • edited

gante commented Mar 6, 2024

danielhanchen commented Mar 6, 2024 •

edited

paulcx commented Mar 6, 2024 •

edited

danielhanchen commented Mar 6, 2024 •

edited

paulcx commented Mar 6, 2024 •

edited