# LLM KV Caching

When a transformer model generates text, it doesn’t work with words directly—it works with tokens, which are small chunks of text (for example, a whole word, part of a word, or even just punctuation). Each new token forces the model to reprocess the entire sequence of tokens that came before it. This creates a lot of unnecessary repetition, wastes computation, and slows things down.

Key-Value (KV) Caching fixes this by storing the results of past attention calculations so the model can reuse them, instead of recalculating from scratch.

In this tutorial, we will build a simple experiment that compares text generation with and without KV caching, and measure the speed difference.

In [None]:
!pip install transformers torch --quiet
#install the transformers library (Hugging Face) and pytorch into the environment
#quiet means without creating additional output

# Loading the Model and Tokenizer

In this next cell, we import PyTorch and Hugging Face classes. We choose a fairly lightweight model to not use up too much computational resources but still prove our point. We then load up the tokenizer (this converts raw text into tokens — basically numbers the model can understand, and later turns tokens back into human-readable tex) and the language model (this is the neural network itself, trained to predict the next token in a sequence and generate coherent text). If possible to run the model on the GPU, we do so, in order to speed things up further.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# We'll use a small GPT-2 variant so it's fast to run
MODEL_NAME = "distilgpt2"
# We'll get a message in the output about using a token here
# but it is a public model so we can ignore it

# Load the tokenizer (turns text into tokens/IDs)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load the language model
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Put the model on GPU if available (much faster)
if torch.cuda.is_available():
    model = model.to("cuda")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Define Prompt and Helper Function

In this cell, we set up our experiment. First, we define a short prompt that the model will continue from, and we specify how many new tokens to generate. Then we create a helper function generate_and_time() that takes a flag (use_cache=True or False) and does the whole generation run. Inside, the function tokenizes the prompt into tensors (which are just multi-dimensional arrays that hold the numerical data the model works with).

These tensors are then moved to the GPU if available, and the function carefully times how long generation takes. The crucial line is use_cache=use_cache inside model.generate(): this controls whether the model stores and reuses past Keys and Values or recomputes them from scratch each time. Finally, the function decodes the tokens back into text and returns both the elapsed time and the generated output.

In [None]:
# The prompt we'll feed into the model
PROMPT = "Once upon a time, in a land far away"

# How many new tokens to generate
MAX_NEW_TOKENS = 50

def generate_and_time(use_cache: bool):
    """Generate text with or without KV caching and measure time taken."""

    # Convert the prompt into model input tensors
    inputs = tokenizer(PROMPT, return_tensors="pt")
    if torch.cuda.is_available():
        # Move inputs to GPU if available
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    # Ensure accurate timing on GPU by synchronizing
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    import time
    start = time.time()

    # Generate text
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_NEW_TOKENS,
        use_cache=use_cache,  # <--- KV caching flag
    )

    # Synchronize again to stop timer correctly
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    elapsed = time.time() - start

    # Decode back to text
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return elapsed, text


# Running Without KV Caching
Here we disable caching and let the model recompute everything at every generation step.

In [None]:
t_no_cache, out_no_cache = generate_and_time(use_cache=False)
#use_cache being False mean we will not use KV Caching

print("Without KV Caching:")
print(f"Time: {t_no_cache:.2f} seconds") #output time taken to output
print(f"Output: {out_no_cache[:300]}");  #output generated text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Without KV Caching:
Time: 4.21 seconds
Output: Once upon a time, in a land far away from the sun, the sun is shining.








































...


# Run with KV Caching
Now we enable caching. Notice how much faster the model runs—especially as the number of tokens increases. The first token still requires a full pass, but after that, cached Keys and Values speed up subsequent steps.

In [None]:
t_cache, out_cache = generate_and_time(use_cache=True)
#use_cache being True means KV Caching will be used

print("With KV Caching")
print(f"Time: {t_cache:.2f} seconds")
print(f"Output: {out_cache[:300]}");
#same as the previous cell

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


With KV Caching
Time: 1.46 seconds
Output: Once upon a time, in a land far away from the sun, the sun is shining.








































...


# Conclusion

As you can see from this code, KV Caching even at a small scale has a significant computational reduction for a (near) identical output. One does not need to imagine how much it helps reduce computational costs for larger scale projects and ou