---
title: KV Caching
subtitle: Key-Value caching inference optimization for LLMs
description: test test test
date: "2025-06-06"
#date-modified: "2025-02-22"
#categories: [news]
bread-crumbs: true
back-to-top-navigation: true
toc: true
toc-depth: 3
#image: images/pizza-13601_256.gif
---


Transformers are almost used everywhere in Natural Language Processing (ChatGPT, Bing)

As you generate more text, you will use more GPU memory

Why do transformers require more memory when dealing with longer texts

openAI's pricing: longer context models are more expensive than smaller context ones
high memory usage when you handle large context lengths

high memory is taken up by the KV cache


"you know how you wait a while for the first token to generat but then the rest of it rips"

## What is KV cache

KV caching (Key-Value caching) is an inference optimization technique for [Transformer](transformer.ipynb)-based models, particularly large language models (LLMs) to increase inference speed and reduce computational costs.

The technique was introduced in the 2023 paper Efficiently scaling transformer inference

KV caching eliminates redundant calculations during text generation by storing previously computed key and value tensors from the attention mechanism.
This makes the AI responses faster and more efficient.

This is the foundation of how modern LLMs can generate long outputs efficiently.

KV -> refers to key and vaule used in attention mechanism
The key and value states are used for calculating the self-attention mechanism.
It caches the Key (K) and Value (V) states in Transformer-based language models

## Motivation for KV Caching

TODO: KV caching is motivated by an inefficiency in the Transformer-based, attention mechanism, autoregressive models

Autoregressive Transformer-based language models like GPT generate text sequentially, one token at a time. 
During inference, the model processes the input sequence of previous tokens $[t_0, \dots, t_i]$ (e.g., "TODO ADD EXAMPLE HERE") to predict the next token $t_{i+1}$​ (e.g., "TODO"). 
Then the model adds the generated token to the input sequence, creating a new input sequence of previous tokens  $[t_0, \dots, t_{i+1}]$ and repeats the process until some stopping criterion is reached (e.g., generating an `<EOS>` token or reaching maximum length).

![](images/autoregression.png)

This creates a fundamental inefficiency: Imagine writing a sentence where, for each new word, you must re-read the entire text from the beginning to understand the context. While this might be manageable for short sentences, this becomes more and more inefficient the longer the text becomes.

The extent of this inefficiency becomes clear when we examine the decoder's masked self-attention mechanism's key and value calculations in Transformer-based language models.

In transformer architectures, the attention mechanism processes input sequences by computing three matrices Query (Q), Key (K), and Value (V). 
$$
Attention(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

![](images/attention.png)

During decoding, these three matrices are not the same size
The input vector $x$ is multiplied by three different matrices $W$, which are learned from data:
- $q = W_q x$
- $k = W_k x$
- $v = W_v x$


you can think of the dot product between q and K as doing attention between the current token that we care about and all of the previous tokens at the same time.

as we generate a sequence one token at a time the K and V matrices dont change very much. a token corresponds to a colum of the K matrix and a row of the V matrix.

The cruicial informaiton is that once we've computed the embedding for this token, it's not going to change again, no matter how many more tokens we generate

but the model still has to do the heavy work of computing the key and value vectors for this word on all subsequent steps.



this results in a quadratic number of matrix vector multiplications. which is very slow

![](images/attention_matrices.png)
![](images/kv_cache_attention.png)



### Inefficiency and where KV caching comes to play

This (what? the redundancy, computation) is inefficient. 
KV Caching is an optimisation technique that mitigates this inefficiency.

- Although transformers are internally parallel, each new prediction requires a full forward pass through all transformer layers, which incurs a quadratic memory/compute in terms of the sequence length. (verbatim)
This repetition also leads to computational redundancy.  (verbatim)

## How does KV caching work?

https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/


- The decode phase generates a single token at each time step, but each token depends on the key and value tensors of all previous tokens (including the input tokens’ KV tensors computed at prefill, and any new KV tensors computed until the current time step). 
- To avoid recomputing all these tensors for all tokens at each time step, it’s possible to cache them in GPU memory
- Every iteration, when new elements are computed, they are simply added to the running cache to be used in the next iteration. 
- In some implementations, there is one KV cache for each layer of the model.





KV caching works by storing the computed key and value tensors from previous tokens, 
To eliminate this inefficiency, we use KV Caching:
1. After processing the initial prompt, we cache the computed keys ( KK ) and values ( VV ) for each layer.
2. During generation, we only compute KK and VV for the new token, and append them to the cache.
3. We compute QQ for the current token and use it with the cached KK and VV to get the output. (verbatim)


![](images/key-value-caching.png)


This changes generation from full-sequence re-computation to a lightweight, incremental update.

allowing the model to:

Reuse computations: Previously calculated K and V matrices are stored in memory
Avoid redundant calculations: Only new tokens require fresh K and V computations
Maintain context: The cached values preserve the model's understanding of earlier context
Accelerate generation: Subsequent tokens generate much faster using cached data


## Benefits, Limitations, and Trafe-offs of KV Caching
the matrices obtained with KV caching are way smaller, which leads to faster matrix multiplications. 
- Speed Enhancement: KV caching can reduce inference time by 50-90% (TODO verify) for longer sequences, as the model doesn't need to recompute attention weights for previously processed tokens.


KV caching eliminates unnecessary computation during autoregressive generation, enabling faster and more efficient inference, 

When to use KV caching:
especially in long sequences and real-time applications. 
KV caching becomes increasingly beneficial with longer conversations and documents, making it essential for applications like: (multi-turn) Chat interfaces with extended conversations, Document analysis and summarization, Code generation with large codebases,


KV caching is a popular method for speeding up LLM inference, making it possible to run them on consumer hardware


The only downside is that it needs more GPU VRAM (or CPU RAM if GPU is not being used) to cache the Key and Value states.
This is a trade-off between speed and memory, and its drawbacks can be more complex code and restricting fancier inference schemes, like beam-search, etc. 


## Where is KV caching used?
KV caching is commonly used the self-attention layers in decoder-only models.

KV caching occurs during multiple token generation steps and only happens in the decoder (i.e., in decoder-only models like GPT, or in the decoder part of encoder-decoder models like T5). Models like BERT are not generative and therefore do not have KV caching.) [verbatim]

## KV cache implementation (in Python, PyTorch)

This is a simplified example of implementing KV caching in PyTorch: (verbatim)

In [None]:
# Pseudocode for KV Caching in PyTorch
class KVCache:
    def __init__(self):
        self.cache = {"key": None, "value": None}

    def update(self, key, value):
        if self.cache["key"] is None:
            self.cache["key"] = key
            self.cache["value"] = value
        else:
            self.cache["key"] = torch.cat([self.cache["key"], key], dim=1)
            self.cache["value"] = torch.cat([self.cache["value"], value], dim=1)

    def get_cache(self):
        return self.cache


### Hugging Face `transformers` library 

Hugging Face Transformers includes built-in KV caching support for most transformer models, automatically managing cache creation and updates during generation. (slop)

When using the transformers library this behavior is enabled by default through the `use_cache` parameter, you can also access multiple caching methods through the cache_implementation parameter, here's a minimalistic code : (verbati)

Different cache implementations offer various trade-offs between speed and memory usage KV cache strategies:

DynamicCache: Default, grows dynamically
StaticCache: Fixed size, faster for known sequence lengths
SinkCache: Keeps important tokens, evicts others
SlidingWindowCache: Fixed window size
OffloadedCache: CPU offloading for memory-constrained GPUs
QuantizedCache: Reduces memory through quantization

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('HuggingFaceTB/SmolLM2-1.7B')
model = AutoModelForCausalLM.from_pretrained('HuggingFaceTB/SmolLM2-1.7B').cuda()

tokens = tokenizer.encode("The red cat was", return_tensors="pt").cuda()
output = model.generate(
    tokens, max_new_tokens=300, use_cache = True # by default is set to True
)
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)[0]



## Inference Speed difference comparison between KV caching vs. standard inference
Let's explore the inference speed difference with and without KV caching using the Hugging Face Transformers library
Speed of ??? without KV caching:
Speed of ??? with KV caching

The difference in inference speed was huge while the GPU VRAM usage was neglectable, as reported here, so make sure to use KV caching in your transformer model!


We benchmarked the code above with/without kv caching on a T4 GPU we got the following results : 

In [None]:
import numpy as np
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

for use_cache in (True, False):
  times = []
  for _ in range(10):  # measuring 10 generations
    start = time.time()
    model.generate(**tokenizer("What is KV caching?", return_tensors="pt").to(device), use_cache=use_cache, max_new_tokens=1000)
    times.append(time.time() - start)
  print(f"{'with' if use_cache else 'without'} KV caching: {round(np.mean(times), 3)} +- {round(np.std(times), 3)} seconds")

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


with KV caching: 108.446 +- 10.526 seconds


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
#| echo: false
#| output: false


## Where is the KV Cache stored
in memory

## Does ChatGPT use cache, how to clean cache

## How big is the KV cache

## What is KV cache used for?

## Can BERT use KV Cache

- LLM, vLLM
- code



## Refenreces

- Pope, Reiner, et al. "Efficiently scaling transformer inference." Proceedings of Machine Learning and Systems 5 (2023): 606-624.
- https://huggingface.co/docs/transformers/main/en/kv_cache