Possible Bug with KV Caching in Llama (original) model #25420

maximkha · 2023-08-09T19:50:33Z

System Info

transformers==4.31.0

huggingface_hub version: 0.15.1
Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Token path ?: /u/k/h/khanov/.cache/huggingface/token
Has saved token ?: False
Configured git credential helpers:
FastAI: N/A
Tensorflow: N/A
Torch: 2.0.0
Jinja2: 3.0.3
Graphviz: N/A
Pydot: N/A
Pillow: 9.0.1
hf_transfer: N/A
gradio: N/A
numpy: 1.24.2
ENDPOINT: https://huggingface.co
HUGGINGFACE_HUB_CACHE: /u/k/h/khanov/.cache/huggingface/hub
HUGGINGFACE_ASSETS_CACHE: /u/k/h/khanov/.cache/huggingface/assets
HF_TOKEN_PATH: /u/k/h/khanov/.cache/huggingface/token
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False

Who can help?

@ArthurZucker, @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I was working on a custom decoding method, however, I found a deviation from greedy search when using KV caching.

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm

MODEL_PATH = "/nobackup-fast/khanov/llama-7b" # "huggyllama/llama-7b"
GEN_DEV = "cuda:0"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16).to(GEN_DEV)

def get_input_ids(prompt: str) -> torch.Tensor:
    global model, tokenizer
    tokens = tokenizer(prompt, return_tensors="pt").input_ids.to(GEN_DEV)
    return tokens
def tokens_to_text(tokens: torch.Tensor):
    return tokenizer.batch_decode(tokens, skip_special_tokens=True)


PROMPT = "This is a " # this is just a test prompt

# greedy decoding without caching 
tokens = get_input_ids(PROMPT)
for _ in tqdm(range(40)):
    with torch.no_grad():
        mout = model(tokens)
    tokens = torch.hstack((tokens, torch.argmax(mout.logits[0, -1]).unsqueeze(0).unsqueeze(0)))
    
without_cache = tokens_to_text(tokens)[0]
print(f"{without_cache=}")

# greedy decoding WITH caching 
tokens = get_input_ids(PROMPT)
cached = None
for _ in tqdm(range(40)):
    with torch.no_grad():
        if cached is None:
            mout = model(tokens, output_hidden_states=True, use_cache=True)
            cached = mout.past_key_values
        else:
            mout = model(tokens, past_key_values=cached, use_cache=True, output_hidden_states=True)
            cached = mout.past_key_values
    tokens = torch.hstack((tokens, torch.argmax(mout.logits[0, -1]).unsqueeze(0).unsqueeze(0)))
    
with_cache = tokens_to_text(tokens)[0]
print(f"{with_cache=}")

# normal greedy search with HF Generate implementation
tokens = get_input_ids(PROMPT)
tokens = model.generate(tokens, num_return_sequences=1, max_new_tokens=40)
generate_output = tokens_to_text(tokens)[0]
print(f"{generate_output=}")

# this matches exactly
assert without_cache == generate_output

# this does not!
assert without_cache == with_cache

Expected behavior

I was expecting the results to not change when using the past_key_values kwarg, however, when passing past_key_values, the model assigned different logits to the tokens. This deviates from the model.generate behavior too. This is possibly related to #18809, and #21080.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-08-09T19:52:15Z

cc @ArthurZucker and @gante

ArthurZucker · 2023-08-10T05:54:51Z

Hey! It seems like the problème is from your custom code rather than the Llama past key values mechanism as generate() uses past key values by default, unless your generation config has generation_config.use_cache = False.

I don't know exactly what is wrong with your custom greedy decoding, but would probably say that you are not feeding the positional ID information that is automatically create in prepare_inputs_for_generation used in the generation.

gante · 2023-08-10T10:32:20Z

Hi @maximkha 👋

Thank you for raising this issue! Sadly, our bandwidth is limited, so our capacity to dive into custom code for which a solution already exists is limited :)

As @ArthurZucker wrote, you are missing the position IDs, which may have a significant impact on the output. The same is true for the attention mask. Our modeling code makes its best effort to infer these two inputs when they are missing, but it fails in some cases.

My suggestion would be to introduce a breakpoint() in generate, before the model forward pass, and compare the inputs that go into the model :)

maximkha · 2023-08-10T15:17:40Z

Thanks so so much! Turns out the prepare_inputs_for_generation function prepared the positional ID information as you said and after adding that in, the results exactly match! I'll go ahead and close this!

maximkha · 2023-08-10T23:10:21Z

Actually, I'm currently experiencing another issue when using this for Llama for sequential classification. It seems that even when I use prepare_inputs_for_generation, I'm getting values that disagree. I'm not exactly sure what the culprit is, but I have been using the appropriate _reorder_cache function.

ArthurZucker · 2023-08-11T09:42:27Z

Are you using padding? If so which padding side are you using? We had a few bug fixes related to padding recently see #24979, should work on main with padding left

maximkha · 2023-08-11T17:24:53Z

Hey @ArthurZucker, thanks for the response. I actually am not doing any padding. Here's a minimally reproducible example:

from transformers import LlamaForSequenceClassification
import torch

# simple attention mask code
def create_attention_mask(seq_len, bsz=1):
    return torch.ones((bsz, seq_len))

# from https://github.com/huggingface/transformers/blob/5e5fa0d88c293e6d5be2517b4f45680ba3bb5df2/src/transformers/models/llama/modeling_llama.py#L856
def prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

# this is huggyllama/llama-7b
MODEL = "/nobackup-fast/khanov/llama-7b"
classification_model = LlamaForSequenceClassification.from_pretrained(MODEL, num_labels=1, torch_dtype=torch.bfloat16).cuda()

# for simplicity (and to clearly illustrate the effect), set all the weights to 1
with torch.no_grad():
    classification_model.score.weight.set_(torch.ones_like(classification_model.score.weight))

# some random tokens
test_tokens = torch.tensor([1,263,29901,2599])
test_tokens = test_tokens.unsqueeze(0).cuda()
# some additional test token that we would like to run our classification model on
new_test_tokens = torch.hstack((test_tokens, torch.tensor([5]).unsqueeze(0).cuda()))

# generate the cache
cls_out = classification_model(**prepare_inputs_for_generation(test_tokens, past_key_values=None, attention_mask=create_attention_mask(test_tokens.shape[-1], test_tokens.shape[0]), use_cache=True))

# run the classification model without any special caching stuff
print("Correct output (with prepare_inputs)")
cls_out_new = classification_model(**prepare_inputs_for_generation(new_test_tokens, past_key_values=None, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0])))
print(f"{cls_out_new.logits=}")
# cls_out_new.logits = 89

# run it without the prepare input (just in case that's the issue)
print("Correct output (no prepare_inputs)")
cls_out_new = classification_model(new_test_tokens)
print(f"{cls_out_new.logits=}")
# cls_out_new.logits = 89

# with caching, and prepare input
print("With past_key_values (with prepare_inputs)")
cls_out_test = classification_model(**prepare_inputs_for_generation(new_test_tokens, past_key_values=cls_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), use_cache=True))

print(f"{cls_out_test.logits=}")
# cls_out_test.logits = 88.5

# with caching, without prepare input
print("With past_key_values (no prepare_inputs)")
cls_out_test = classification_model(new_test_tokens[:, -1:], past_key_values=cls_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), position_ids=torch.tensor([[new_test_tokens.shape[-1] -1]]), use_cache=True)

print(f"{cls_out_test.logits=}")
# cls_out_test.logits = 88.5

The prepare_inputs_for_generation was taken from here.

Please let me know if anything seems wrong about this! I really appreciate the help!

maximkha · 2023-08-11T18:56:29Z

Hmmmm this is also happening if I replace the LlamaForSequenceClassification with LlamaForCausalLM.

There are slight discrepancies in the logits:

Example

from transformers import LlamaForSequenceClassification, LlamaForCausalLM
import torch

# this is huggyllama/llama-7b
MODEL = "/nobackup-fast/khanov/llama-7b"
llm = LlamaForCausalLM.from_pretrained(MODEL, num_labels=1, torch_dtype=torch.bfloat16).cuda()

# simple attention mask code
def create_attention_mask(seq_len, bsz=1):
    return torch.ones((bsz, seq_len))

# from https://github.com/huggingface/transformers/blob/5e5fa0d88c293e6d5be2517b4f45680ba3bb5df2/src/transformers/models/llama/modeling_llama.py#L856
def prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs
    
# for simplicity (and to clearly illustrate the effect), set all the weights to 1
# with torch.no_grad():
#     classification_model.score.weight.set_(torch.ones_like(classification_model.score.weight))

# some random tokens
test_tokens = torch.tensor([1,263,29901,2599])
test_tokens = test_tokens.unsqueeze(0).cuda()
# some additional test token that we would like to run our classification model on
new_test_tokens = torch.hstack((test_tokens, torch.tensor([5]).unsqueeze(0).cuda()))

# generate the cache
llm_out = llm(**prepare_inputs_for_generation(test_tokens, past_key_values=None, attention_mask=create_attention_mask(test_tokens.shape[-1], test_tokens.shape[0]), use_cache=True))

# run the classification model without any special caching stuff
print("Correct output (with prepare_inputs)")
llm_out_new = llm(**prepare_inputs_for_generation(new_test_tokens, past_key_values=None, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0])))
print(f"{llm_out_new.logits[0, -1, :]=}")
"""Correct output (with prepare_inputs)
llm_out_new.logits[0, -1, :]=tensor([-12.0625, -15.3125,   2.5781,  ...,  -6.4688,  -8.1250,  -6.8125],
       device='cuda:0', grad_fn=<SliceBackward0>)"""

# run it without the prepare input (just in case that's the issue)
print("Correct output (no prepare_inputs)")
llm_out_new = llm(new_test_tokens)
print(f"{llm_out_new.logits[0, -1, :]=}")
"""Correct output (no prepare_inputs)
llm_out_new.logits[0, -1, :]=tensor([-12.0625, -15.3125,   2.5781,  ...,  -6.4688,  -8.1250,  -6.8125],
       device='cuda:0', grad_fn=<SliceBackward0>)"""

# with caching, and prepare input
print("With past_key_values (with prepare_inputs)")
llm_out_test = llm(**prepare_inputs_for_generation(new_test_tokens, past_key_values=llm_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), use_cache=True))

print(f"{llm_out_test.logits[0, -1, :]=}")
"""With past_key_values (with prepare_inputs)
llm_out_test.logits[0, -1, :]=tensor([-12.0625, -15.3750,   2.5938,  ...,  -6.5000,  -8.1250,  -6.8125],
       device='cuda:0', grad_fn=<SliceBackward0>)"""

# with caching, without prepare input
print("With past_key_values (no prepare_inputs)")
llm_out_test = llm(new_test_tokens[:, -1:], past_key_values=llm_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), position_ids=torch.tensor([[new_test_tokens.shape[-1] -1]]), use_cache=True)

print(f"{llm_out_test.logits[0, -1, :]=}")

"""With past_key_values (no prepare_inputs)
llm_out_test.logits[0, -1, :]=tensor([-12.0625, -15.3750,   2.5938,  ...,  -6.5000,  -8.1250,  -6.8125],
       device='cuda:0', grad_fn=<SliceBackward0>)"""

maximkha · 2023-08-11T20:08:34Z

Ok I think I found the culprit! It seems that when using past_key_values, and bfloat16 the errors are huge.

float32 (default):
max abs diff between logits (with vs without past_key_values) = 1.0490e-05

With bfloat16:
max abs diff between logits (with vs without past_key_values) = 0.1250

With float16:
max abs diff between logits (with vs without past_key_values) = 0.0195

Since the unit tests only check for f32, they aren't catching this.

Here's the script to measure this:

Script

from transformers import LlamaForSequenceClassification, LlamaForCausalLM
import torch

# this is huggyllama/llama-7b
MODEL = "/nobackup-fast/khanov/llama-7b"
WITH_BFLOAT16 = False

if WITH_BFLOAT16:
    llm = LlamaForCausalLM.from_pretrained(MODEL, num_labels=1, torch_dtype=torch.bfloat16).cuda()
else:
    llm = LlamaForCausalLM.from_pretrained(MODEL, num_labels=1).cuda()

# simple attention mask code
def create_attention_mask(seq_len, bsz=1):
    return torch.ones((bsz, seq_len))

# from https://github.com/huggingface/transformers/blob/5e5fa0d88c293e6d5be2517b4f45680ba3bb5df2/src/transformers/models/llama/modeling_llama.py#L856
def prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

# some random tokens
test_tokens = torch.tensor([1,263,29901,2599])
test_tokens = test_tokens.unsqueeze(0).cuda()
# some additional test token that we would like to run our classification model on
new_test_tokens = torch.hstack((test_tokens, torch.tensor([5]).unsqueeze(0).cuda()))

# generate the cache
llm_out = llm(**prepare_inputs_for_generation(test_tokens, past_key_values=None, attention_mask=create_attention_mask(test_tokens.shape[-1], test_tokens.shape[0]), use_cache=True))

# run the classification model without any special caching stuff
print("Correct output (with prepare_inputs)")
llm_out_new = llm(**prepare_inputs_for_generation(new_test_tokens, past_key_values=None, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0])))
print(f"{llm_out_new.logits[0, -1, :]=}")

# run it without the prepare input (just in case that's the issue)
print("Correct output (no prepare_inputs)")
llm_out_new = llm(new_test_tokens)
print(f"{llm_out_new.logits[0, -1, :]=}")

# with caching, and prepare input
print("With past_key_values (with prepare_inputs)")
llm_out_test = llm(**prepare_inputs_for_generation(new_test_tokens, past_key_values=llm_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), use_cache=True))

print(f"{llm_out_test.logits[0, -1, :]=}")
print(f"{torch.max(torch.abs(llm_out_new.logits[0, -1, :]-llm_out_test.logits[0, -1, :]))=}")
# HERE: this is 1.0490e-05 when using f32, and 0.1250 when using bfloat16

# with caching, without prepare input
print("With past_key_values (no prepare_inputs)")
llm_out_test = llm(new_test_tokens[:, -1:], past_key_values=llm_out.past_key_values, attention_mask=create_attention_mask(new_test_tokens.shape[-1], new_test_tokens.shape[0]), position_ids=torch.tensor([[new_test_tokens.shape[-1] -1]]), use_cache=True)

print(f"{llm_out_test.logits[0, -1, :]=}")
print(f"{torch.max(torch.abs(llm_out_new.logits[0, -1, :]-llm_out_test.logits[0, -1, :]))=}")
# HERE: this is 1.0490e-05 when using f32, and 0.1250 when using bfloat16

Any ideas of how to fix this discrepancy?

maximkha · 2023-08-14T19:13:36Z

@ArthurZucker, any updates on this?

ArthurZucker · 2023-08-16T11:38:32Z

Hey @maximkha I don't have an update on this right now no 😅 will let @gante have a look I will not have time to dive into this.

maximkha · 2023-08-16T14:55:12Z

I appreciate the update!

gante · 2023-08-16T15:18:13Z

Likewise, I won't have bandwidth to help unless it is a bug from a short reproducible script, based on a non-custom generate :)

maximkha · 2023-08-16T19:29:22Z

Hey @gante, this isn't an issue with generate specifically, it seems to be that when using the key_value_caching and bfloat16, the logits are significantly different from the non-cached version (some precision loss I'm assuming). There is no generation involved, just using key_values with bfloat16 skews the logits.

I'm not sure if this level of precision loss is to be expected or not.

TL;DR this is a problem with precision + caching, not generate.

Also, sorry for all the messages, but this level of precision loss is impacting my results.

gante · 2023-10-23T14:22:04Z

Hey folks 👋 I’ve done a deep dive on this issue, and I will link related issues to this comment that attempts to summarize findings.

cc:

@maximkha, who has been rightly pursuing us to figure out this mismatch;
@ArthurZucker, who has been seeing other issues like this

TL;DR

Using KV caches, assisted generation, and batching will change the logits. This happens in most, if not all models at all precisions, but it is almost imperceptible in FP32. With 16 bits, the difference becomes non-negligible. The model was not trained with KV caches or left-padding, so this is introducing a distribution shift -- it’s part of the cost of using a lower precision and other related optimizations. The effect is more visible when do_sample=True, as greedy decoding (do_sample=False) often selects the same token despite the differences.

Why does this happen?

A key operation in neural networks is matrix multiplication, where values are multiplied and accumulated. Unless you have infinite precision, different implementations or different shapes (e.g. crop a few rows of the first matrix) may produce different outputs, as the intermediary calculations must remain in the specified precision and are subject to rounding. For instance, our models with TF and JAX implementations never have the exact output as the PyTorch implementation, they tend to differ by a maximum 1e-5 at FP32 for the same exact input, due to minor differences in the frameworks' inner implementations.

When using KV caches (and, in some models, left-padding), we are changing the input shape to some matrix multiplication operations. For instance, in Llama, when you apply the linear projection to obtain the QKV for the attention layer, the input shape will be different depending on whether you're using left-padding and/or KV caches. Therefore, the output of these operations may be different, and these tiny differences build up across layers and across generated tokens, especially at lower resolutions.

If you place a breakpoint inside the model, and see what happens with and without KV caches, you'll see:

During prefill (parsing the input prompt), the KV caches and the hidden states are exactly the same, as the inputs contain the same values and shapes.
When generating one token at a time, you will see a divergence happening in the hidden states and the QKV after operations like linear layers.

How big is this difference?

Let's do a simple experiment: for the same set of inputs, let's measure the hidden states' and the logits' maximum difference for the first generated token, with and without KV caching. I created the following test script from an example given in a related issue (#26344). TL;DR it averages the maximum value for the variables described above over 1000 runs:

Test script

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from datasets import load_dataset
from tqdm import tqdm


TOTAL_NUM_SAMPLES = 1000
INPUT_LEN = 64

model_name = "codellama/CodeLlama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto"
)

# model = AutoModelForCausalLM.from_pretrained(model_name)

ds = load_dataset("bigcode/the-stack", data_dir="data/python", split="train", streaming=True)
ds_iterator = iter(ds.take(TOTAL_NUM_SAMPLES))
max_diffs = {}
for _ in tqdm(range(TOTAL_NUM_SAMPLES)):
    next_data = next(ds_iterator)["content"]
    all_input_ids = tokenizer(
        [next_data], return_tensors="pt", max_length=INPUT_LEN, truncation=True
    ).input_ids.to(model.device)

    # process the whole sequence
    all_outputs = model(all_input_ids, output_hidden_states=True, return_dict=True)
    # get logits for the last token
    last_token_logits = all_outputs.logits[0][-1:]

    # process the sequence except the last token
    kv = model(all_input_ids[:, :-1]).past_key_values
    # input only the last token with previous kv_cache
    new_output = model(all_input_ids[:, -1:], past_key_values=kv, output_hidden_states=True, return_dict=True)
    # extract the last token logits
    new_last_token_logits = new_output.logits[0][-1:]

    for layer_idx in range(len(all_outputs.hidden_states)):
        max_diff = torch.abs(
            all_outputs.hidden_states[layer_idx][:, -1, :] - new_output.hidden_states[layer_idx]
        ).max()
        max_diffs.setdefault(f"layer {layer_idx}", []).append(max_diff.cpu().item())

    # theese two distributions should be equal, but they are not.
    max_diffs.setdefault("logits", []).append(torch.abs(last_token_logits - new_last_token_logits).max().cpu().item())

for key, value in max_diffs.items():
    print(f"{key}: {sum(value) / len(value)}")

Here are the results I got for CodeLlama (which uses the same code as Llama and Llama2), with GPT2 in FP16 for comparison:

Llama, FP32

layer 0: 0.0
layer 1: 4.981691017746925e-07
layer 2: 2.5094859302043914e-06
layer 3: 2.6547210291028024e-06
layer 4: 2.8776237741112707e-06
layer 5: 3.2249726355075836e-06
layer 6: 3.5362401977181435e-06
layer 7: 3.871295601129532e-06
layer 8: 4.376612603664398e-06
layer 9: 4.956845194101334e-06
layer 10: 5.649109371006489e-06
layer 11: 6.595022976398468e-06
layer 12: 6.92228227853775e-06
layer 13: 7.3333755135536194e-06
layer 14: 7.672600448131561e-06
layer 15: 8.006669580936431e-06
layer 16: 8.94695520401001e-06
layer 17: 9.912904351949691e-06
layer 18: 1.0702745988965035e-05
layer 19: 1.2084681540727615e-05
layer 20: 1.3510849326848984e-05
layer 21: 1.4993250370025634e-05
layer 22: 1.5627190470695495e-05
layer 23: 1.9214315339922905e-05
layer 24: 1.9937701523303985e-05
layer 25: 2.1439727395772934e-05
layer 26: 2.1951720118522644e-05
layer 27: 2.3870080709457398e-05
layer 28: 2.5171246379613875e-05
layer 29: 2.614951878786087e-05
layer 30: 2.8442054986953734e-05
layer 31: 3.540612757205963e-05
layer 32: 1.0248859878629445e-05
logits: 1.5035882592201234e-05

Llama, FP16 (the expected 16-bit format to use)

layer 0: 0.0
layer 1: 0.000550079345703125
layer 2: 0.00298907470703125
layer 3: 0.0033966217041015625
layer 4: 0.0039486083984375
layer 5: 0.00466839599609375
layer 6: 0.00533612060546875
layer 7: 0.00594580078125
layer 8: 0.006715240478515625
layer 9: 0.00763134765625
layer 10: 0.008845230102539063
layer 11: 0.01030645751953125
layer 12: 0.011149169921875
layer 13: 0.011803375244140626
layer 14: 0.01296966552734375
layer 15: 0.013913818359375
layer 16: 0.015769287109375
layer 17: 0.01764404296875
layer 18: 0.01888623046875
layer 19: 0.02110791015625
layer 20: 0.023257568359375
layer 21: 0.025254150390625
layer 22: 0.02687548828125
layer 23: 0.03120947265625
layer 24: 0.032493896484375
layer 25: 0.03505859375
layer 26: 0.037328369140625
layer 27: 0.0409736328125
layer 28: 0.0434375
layer 29: 0.0456640625
layer 30: 0.04978125
layer 31: 0.060069580078125
layer 32: 0.015433685302734375
logits: 0.016572296142578127

Llama, BF16 (the wrong 16-bit format to use with Llama)

layer 0: 0.0
layer 1: 0.00433740234375
layer 2: 0.03967041015625
layer 3: 0.0434326171875
layer 4: 0.047635498046875
layer 5: 0.0537783203125
layer 6: 0.058983642578125
layer 7: 0.0638212890625
layer 8: 0.0715574951171875
layer 9: 0.0787001953125
layer 10: 0.0854931640625
layer 11: 0.09280859375
layer 12: 0.09901171875
layer 13: 0.107640625
layer 14: 0.11785498046875
layer 15: 0.1256083984375
layer 16: 0.1408369140625
layer 17: 0.156142578125
layer 18: 0.17044140625
layer 19: 0.191591796875
layer 20: 0.20652734375
layer 21: 0.2248125
layer 22: 0.239251953125
layer 23: 0.272525390625
layer 24: 0.2862265625
layer 25: 0.30887890625
layer 26: 0.329537109375
layer 27: 0.359927734375
layer 28: 0.3814072265625
layer 29: 0.400908203125
layer 30: 0.44475390625
layer 31: 0.5362109375
layer 32: 0.13218017578125
logits: 0.1447247314453125

GPT2, FP16

layer 0: 0.0
layer 1: 0.010214111328125
layer 2: 0.011416259765625
layer 3: 0.0163514404296875
layer 4: 0.0228807373046875
layer 5: 0.0232802734375
layer 6: 0.0260006103515625
layer 7: 0.02941253662109375
layer 8: 0.03486376953125                                                                                                                                                                                             layer 9: 0.04135888671875                                                                                                                                                                                             layer 10: 0.0513974609375
layer 11: 0.0786591796875
layer 12: 0.190262451171875
logits: 0.1796796875

As we can see:

The error propagates (and increases) across layers
Lower precisions greatly increase the mismatch between using KV cache or not
BF16 is more sensible to this difference than FP16 -- this is expected, BF16 dedicates more bits to the exponent, so rounding errors are larger
This phenomenon also happens in battle-tested models like GPT2

What can we do about it?

First of all: the benefits of using variables with lower precision and KV caching is obvious. Are the downsides worth it? My advice is to measure the model on metrics relevant to your task (e.g. perplexity), and compare the cost-benefits on your use case. I suspect using KV caching will remain cost-effective :)

Secondly: there may be ways to reduce this mismatch, but so far I haven't found any. A common trick is to upcast some sensible operations to FP32 (like the on the attention layers' softmax). For completeness, on Llama, I tried:

Upcasting the Linear layers in the attention layer
Running the whole attention layer in FP32
Running apply_rotary_pos_emb in FP32 (while keeping sin and cos in FP32 as well)
In the decoder layer, upcasting self.input_layernorm(hidden_states)
In the decoder layer, upcasting self.post_attention_layernorm(hidden_states)

Most had no impact, some reduced the mismatch at a high throughput cost.

Finally, regarding left-padding: We might be able to mitigate problems here when we migrate batched generation to nested tensors, which don't need padding.

I hope this comprehensive analysis helps you understand what's going on 🤗 And, who knows, be the spark that ignites a solution to this issue 🪄

VictorSanh · 2023-10-26T18:36:00Z

Thanks for the detailed explanation @gante ! makes a lot of sense!

github-actions · 2023-11-21T08:06:51Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

varung · 2023-11-24T07:18:22Z

Hey folks 👋 I’ve done a deep dive on this issue, and I will link related issues to this comment that attempts to summarize findings.

cc:

@maximkha, who has been rightly pursuing us to figure out this mismatch;

@ArthurZucker, who has been seeing other issues like this

@gante why do you say that "bf16" is the wrong precision to use with LLAMA?

gante · 2023-11-28T09:38:40Z

@varung "wrong" is perhaps too strong of a word -- suboptimal would be more precise. We have collaborated with the authors of Llama 2, and they have suggested the use of fp16. You can see it in our examples, when we released the model (e.g. here).

In practice, it depends on how the model is saved -- we should load the model in the format in which it was stored. If it was stored in fp32 and you want to operate it in a 16-bit precision, fp16 is superior.

jmzeng · 2023-11-28T18:39:44Z

@gante Thanks for the explanation. I'm wondering if we would see problems if we are switching from a model trained in bf16 to fp16.

For example, we're using a version of the fine-tuned llama2 model, longchat v1.5, which seems to be finetuned with bf16. In the case, would it be more optimal to continue finetuning with fp16 or bf16? Moreover, would we see model loss degradation from switching back to fp16 after tuning with bf16? Thanks.

gante · 2023-11-30T13:09:02Z

Hey @jmzeng 👋

It's impossible to convert between fp16 and bf16 without rounding, which means that your model will lose performance once you switch. Switching before fine-tuning might be okay, depending on the model and how long your fine-tuning is -- you give the model a chance to recover from the rounding errors. However, switching before inference will be a source of distribution drift, which almost surely will negatively impact your downstream performance.

That being said, note that bf16 is indeed better for fine-tuning due to its dynamic precision range, and fp16 tends to excel at inference time due to its better accumulation precision. So it's not an easy answer here :D

Finally, if you're using techniques like LORA (see our peft library), you can get away with doing fine-tuning in fp32. Then, you can downcast to fp16 with fewer problems.

github-actions · 2024-01-01T08:06:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

maximkha closed this as completed Aug 10, 2023

maximkha reopened this Aug 10, 2023

csarron mentioned this issue Sep 3, 2023

Batch Decoding of LMs will cause different outputs with different batch size #25921

Open

4 tasks

huggingface deleted a comment from github-actions bot Sep 13, 2023

ArthurZucker mentioned this issue Sep 22, 2023

usage of past_key_values produces different output than the whole sequence at once #26344

Closed

4 tasks

huggingface deleted a comment from github-actions bot Oct 13, 2023

gante mentioned this issue Oct 23, 2023

bfloat16/float16 llama model logits are significantly different when the same input is in a batch #26869

Closed

4 tasks

younesbelkada mentioned this issue Oct 24, 2023

if i use cache in gpt2 model from transformers , the logits are different if i do a forward pass from scratch #27040

Closed

4 tasks

maxjeblick mentioned this issue Oct 25, 2023

[BUG] Cannot Reproduce H2O Prediction Output h2oai/h2o-llmstudio#450

Closed

ArthurZucker mentioned this issue Nov 21, 2023

8-Bit LlamaModel returns different outputs for the same input when in different batches #27626

Closed

4 tasks

ArthurZucker mentioned this issue Dec 9, 2023

Group beam search decoded result depends on pad_token_id even though it's not printable #27894

Closed

4 tasks

ArthurZucker mentioned this issue Jan 8, 2024

model.generate() produces different results with paddings #28385

Closed

4 tasks

github-actions bot closed this as completed Jan 9, 2024

This was referenced Jan 13, 2024

Inconsistent in batch generation results #28491

Closed

Fix _speculative_sampling implementation #28508

Merged

ArthurZucker mentioned this issue Jan 26, 2024

Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

Closed

4 tasks

gante mentioned this issue Jan 30, 2024

Sampled/inconsistent output despite do_sample set to False huggingface/optimum-neuron#448

Open

gante mentioned this issue Feb 12, 2024

the generated results are different between generating a batch input_ids and a single sequence input_ids #22707

Closed

4 tasks

amyeroberts mentioned this issue Feb 15, 2024

Padding causes forward to produce different logits (Llama2-7b) #29029

Closed

4 tasks

gante mentioned this issue Feb 20, 2024

Generate: low memory tests are flaky #29136

Closed

This was referenced Feb 27, 2024

LlavaForConditionalGeneration logit values are batch size dependent. #29282

Closed

Inputs left-padded passed to Instruct-Mistral-7B, with FlashAttention-2, causes garbage outputs for the padded sequences #29075

Closed

ShahRutav mentioned this issue Feb 27, 2024

LlavaForConditionalGeneration logit values are batch size dependent. #29327

Closed

This was referenced Mar 4, 2024

FLAVA ITM output changes due to padding #29408

Closed

the accuracy issue of left padding and right padding #29419

Closed

zucchini-nlp mentioned this issue Mar 4, 2024

Inconsistent hidden_states in inference #29428

Closed

gante mentioned this issue Mar 7, 2024

Generate: left-padding test, revisited #29515

Merged

GennVa mentioned this issue Mar 28, 2024

Generated results are different between generating with padding and single batch, with QWEN #29936

Closed

This was referenced May 3, 2024

Batch inputs get different result to single input for llama model. #30378

Closed

Assisted decoding results are not correct #30413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Bug with KV Caching in Llama (original) model #25420

Possible Bug with KV Caching in Llama (original) model #25420

maximkha commented Aug 9, 2023 •

edited

sgugger commented Aug 9, 2023

ArthurZucker commented Aug 10, 2023

gante commented Aug 10, 2023

maximkha commented Aug 10, 2023

maximkha commented Aug 10, 2023

ArthurZucker commented Aug 11, 2023

maximkha commented Aug 11, 2023

maximkha commented Aug 11, 2023

maximkha commented Aug 11, 2023 •

edited

maximkha commented Aug 14, 2023

ArthurZucker commented Aug 16, 2023

maximkha commented Aug 16, 2023

gante commented Aug 16, 2023

maximkha commented Aug 16, 2023 •

edited

gante commented Oct 23, 2023 •

edited

VictorSanh commented Oct 26, 2023

github-actions bot commented Nov 21, 2023

varung commented Nov 24, 2023

gante commented Nov 28, 2023 •

edited

jmzeng commented Nov 28, 2023

gante commented Nov 30, 2023 •

edited

github-actions bot commented Jan 1, 2024

Possible Bug with KV Caching in Llama (original) model #25420

Possible Bug with KV Caching in Llama (original) model #25420

Comments

maximkha commented Aug 9, 2023 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Aug 9, 2023

ArthurZucker commented Aug 10, 2023

gante commented Aug 10, 2023

maximkha commented Aug 10, 2023

maximkha commented Aug 10, 2023

ArthurZucker commented Aug 11, 2023

maximkha commented Aug 11, 2023

maximkha commented Aug 11, 2023

maximkha commented Aug 11, 2023 • edited

maximkha commented Aug 14, 2023

ArthurZucker commented Aug 16, 2023

maximkha commented Aug 16, 2023

gante commented Aug 16, 2023

maximkha commented Aug 16, 2023 • edited

gante commented Oct 23, 2023 • edited

TL;DR

Why does this happen?

How big is this difference?

What can we do about it?

VictorSanh commented Oct 26, 2023

github-actions bot commented Nov 21, 2023

varung commented Nov 24, 2023

gante commented Nov 28, 2023 • edited

jmzeng commented Nov 28, 2023

gante commented Nov 30, 2023 • edited

github-actions bot commented Jan 1, 2024

maximkha commented Aug 9, 2023 •

edited

maximkha commented Aug 11, 2023 •

edited

maximkha commented Aug 16, 2023 •

edited

gante commented Oct 23, 2023 •

edited

gante commented Nov 28, 2023 •

edited

gante commented Nov 30, 2023 •

edited