HF's llama implementation is different from meta-llama's #30872

tianyu-l · 2024-05-16T22:13:26Z

System Info

E.g. the rotary embedding implementations are quite different.

I'm wondering

which implementation are the HF's model weights (e.g. in "meta-llama/Meta-Llama-3-8B") trained on?
which are we running inference on top of?

Here are the code pointers:
HF:

transformers/src/transformers/models/llama/modeling_llama.py

Line 184 in 15c74a2

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):

meta-llama: https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a9596f1dea06d9c564f9c0e35f/llama/model.py#L65

Who can help?

@ArthurZucker @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

Expected behavior

consistent implementation of llama models

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-05-17T11:09:41Z

? The llama2 code is subject to a code licence. No one is allowed to use the original code, which is why we have a different version.
As long as the outputs produced are the same, this should not matter no?

TGI uses a similar computation of the cos and sin
pretty sure vllm as well
transformers has more than 15 models that use this formulation of ROPE, not a single one has the one in MetaLlama's repo

See this: https://github.com/meta-llama/llama/blob/main/LICENSE

rlrs · 2024-05-17T20:19:34Z

@ArthurZucker I can expand on this a bit, because there is an actual difference in the implementations. The difference is whether you treat sequential entries of the embeddings as (real, imaginary), or you treat the first half as real, and the second half as imaginary. I think this has to mean that weights trained with one version must be incompatible with the other version?

This has come up over in torchtitan (as linked in the OP of this issue), because I've been trying to load the HF llama weights and using them with the torchtitan codebase. However, these weights only function if I use the first half/second half implementation, such as the one used in HF transformers.

tianyu-l · 2024-05-17T22:43:12Z

@ArthurZucker

As long as the outputs produced are the same, this should not matter no?

Being able to load doesn’t mean the implementations are the same. The difference in the implementations will cause the output to be different, given the same input.

I'm a bit confused here... Please correct me if I'm wrong
Assuming the weights are from Meta, and are trained using Meta’s implementation:
Then the consequence is that the weights are optimized for meta’s llama, although still usable on HF’s llama. So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

cc: @rlrs

rlrs · 2024-05-18T07:54:12Z

Okay, so this has been cleared up, I think. The HF weights are indeed different:

transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py

Lines 152 to 154 in 481a957

    
           # permute for sliced rotary 
        
           def permute(w, n_heads, dim1=dim, dim2=dim): 
        
               return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

Glad to find out that it is a simple weight permutation. This should probably be clarified somewhere.

ArthurZucker · 2024-05-20T12:20:56Z

@rlrs I probably should have specify that yes, this works because we permute the weights. But if you don't and still use the cos/sin formulation you still have a different implementation. You just do the permutation on the fly of the q and k which costs a lot.

Pretty sure it is clarified that you need a conversion script to go from the orginal model to transformers, but if you want to add this to the readme feel free to do so.

@tianyu-l I don't understand the confusion, if the implementation is different, but the produced output, meaning the generations are correct, where is the problem?
Imagine a genius finds new ways to compute matrix - vector products, and we use this (cf FlashAttention) which meta did not use to train their model, is it bad? Even if the outputs , the logits and the generations are the same and the generation speed is around 2x faster?
If you use torch.compile, or if you use gpt-fast, you are again not using the same implementation, but the results do match. How is that possible?

So the inference results should be worse on HF’s model, and if people start fine-tuning from HF’s llama, they should see loss much higher (at least initially).

this is just completely wrong, and unsourced. When we add a model to transformers, we make sure that we reproduce the results. For Llama family, the logits match at 1e-2 (because of parallel training + maybe a bit of rope) but the generated output match 1-1 (generate 1000000 tokens greedy and you will have the same results.)

tianyu-l · 2024-05-20T18:13:33Z

Thanks for the clarification! When I said different implementations, I meant different implementations that leads to different results on the same input. (I thought this was very clear from the context but it seems not, my bad.)

I agree that as long as the outputs match, implementations don't matter the most.

ArthurZucker · 2024-05-23T10:06:13Z

cool, then we are on the same page! 🤗

tianyu-l mentioned this issue May 16, 2024

RoPE implementation differences pytorch/torchtitan#335

Closed

ArthurZucker closed this as completed May 17, 2024

YueZhengMeng mentioned this issue May 21, 2024

为什么transformers的RoPE实现会想到每隔d/2为一组呢 YueZhengMeng/MyLlama#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF's llama implementation is different from meta-llama's #30872

HF's llama implementation is different from meta-llama's #30872

tianyu-l commented May 16, 2024

ArthurZucker commented May 17, 2024

rlrs commented May 17, 2024

tianyu-l commented May 17, 2024

rlrs commented May 18, 2024 •

edited

Loading

ArthurZucker commented May 20, 2024 •

edited

Loading

tianyu-l commented May 20, 2024

ArthurZucker commented May 23, 2024

HF's llama implementation is different from meta-llama's #30872

HF's llama implementation is different from meta-llama's #30872

Comments

tianyu-l commented May 16, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented May 17, 2024

rlrs commented May 17, 2024

tianyu-l commented May 17, 2024

rlrs commented May 18, 2024 • edited Loading

ArthurZucker commented May 20, 2024 • edited Loading

tianyu-l commented May 20, 2024

ArthurZucker commented May 23, 2024

rlrs commented May 18, 2024 •

edited

Loading

ArthurZucker commented May 20, 2024 •

edited

Loading